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Preface 



This book has its origins in the Multimedia Communications Research Cen- 
ter at Bell Laboratories. It will likely be the last thing coming from this Center, 
since it no longer exists. This book reflects our vision on next-generation mul- 
timedia communication systems, shared by many researchers who worked with 
us at a time in the last several years in the Bell Labs’ Acoustics and Speech 
Research Department. Before falling apart in the recent relentless telecom- 
munication firestorm, this department had a legendary history of producing 
ground-breaking technological advances that have contributed greatly to the 
success of fundamental voice and advanced multimedia communications tech- 
nologies. Its scientific discoveries and innovative inventions made a remarkable 
impact on the cultures within both the industrial and academic communities. 
But due to the declining financial support from its parent company. Lucent 
Technologies, Bell Labs decided to lower, if not completely stop, the effort on 
developing terminal technologies for multimedia communications. As a result, 
many budding and well established acoustics and speech scientists left Bell 
Labs but soon resumed their painstaking and rewarding research elsewhere. 
Their spirit for innovation and passion for what they believe did not disappear; 
quite to the contrary it will likely live on (this book is the best evidence). The 
“flame” of their curiosity and enthusiasm will keep burning and will be trans- 
mitted to the next generation without a doubt. The idea of editing such a book 
was triggered early in 2003 when we discussed how to continue our ongoing 
collaboration after we could no longer work together in the same place. By 
inviting our former colleagues and collaborators to share their thoughts and 
state-of-the-art research results, we hope to bring the readers to the very fron- 
tiers of audio signal processing for next-generation multimedia communication 
systems. 

We deeply appreciate the efforts, interest, and enthusiasm of all of the con- 
tributing authors. Thanks to their cooperation, editing this book turned out to 
be a very pleasant experience for us. We are grateful to our former department 




xii Audio Signal Processing 



head, Dr. Biing-Hwang (Fred) Juang (currently Motorola Foundation Chair Pro- 
fessor at the Georgia Institute of Technology) for his role in encouraging our 
research through the department’s most difficult time. Without his inspiration, 
this book would have never been published. We would also like to thank our 
former colleagues. Dr. Mohan Sondhi, Dr. James West, Dr. Steven Gay, Dr. 
Dennis Morgan, Dr. Frank Soong, and many others, for their friendship and 
support over the years. For the preparation of the book, we would like to thank 
Alex Greene, our production editor, and Melissa Sullivan at Kluwer Academic 
Publishers. 



Yiteng (Arden) Huang, Jacob Benesty 




Contributing Authors 



Robert Aichner 

University of Erlangen-Nuremberg 

Carlos Avendano 

Creative Advanced Technology Center 

Jacob Benesty 

Universite du Quebec, INRS-EMT 

Herbert Buchner 

University of Erlangen-Nuremberg 

Jingdong Chen 

Bell Eabs, Eucent Technologies 

Eric J. Diethom 

Avaya Eabs 

Gary W. Elko 

Avaya Eabs 

Volker Fischer 

Darmstadt University of Technology 

Tomas Gansler 

Agere Systems 




xiv Audio Signal Processing 



Yiteng (Arden) Huang 

Bell Labs, Lucent Technologies 

Walter Kellermann 

University of Erlangen-Nuremberg 



Achim Kuntz 

University of Erlangen-Nuremberg 

Jens Meyer 

MH Acoustics 



Rudolf Rabenstein 

University of Erlangen-Nuremberg 



Markus Rupp 

Vienna University of Technology 



Gerald Schuler 

Fraunhofer AEMT 

Sascha Spors 

University of Erlangen-Nuremberg 



Heinz Teutsch 

University of Erlangen-Nuremberg 




Chapter 1 



INTRODUCTION 



Yiteng (Arden) Huang 

Bell Laboratories, Lucent Technologies 

arden @ research.bell-labs.com 



Jacob Benesty 

Universite du Quebec, INRS-EMT 

benesty @ inrs-emt.uquebec.ca 

1. MULTIMEDIA COMMUNICATIONS 

Modem communications technology has become a part of our daily experi- 
ence and has dramatically changed the way we live, receive education, work, 
and relate to each other. Nowadays, communications are already fundamental 
to smooth functioning of the contemporary society and our individual lives. 
The expeditious growth in our ability to communicate was one of the most rev- 
olutionary advancements in our society over the last century, particularly the 
last two decades. 

In recent progress of the communication revolution, we observe four techni- 
cal developments that have altered the entire landscape of the telecommunica- 
tion marketplace. The first is the proliferation of data transfer rate. Thanks to 
technological breakthroughs in fiber optics, the data transfer rate was boosted 
from around 100 Mbps (Megabits per second) for one single fiber at the be- 
ginning of the 1980s to today’s 400 Gbps (Gigabits per second). The capacity 
of optical fiber has increased by a factor of four thousand in just twenty years, 
exceeding Moore’s law (which describes the growth in the number of tran- 
sistors per square inch on integrated circuits.) The second is the ubiquity of 
packet switched networks driven by the ever-growing popularity of the Internet 
and the World-Wide Web. The invention of the Internet and the World-Wide 
Web created a common platform for us to share highly diverse information in 
a relatively unified manner. Ubiquitous packet switched networks make the 
world more intertwined than ever from an economical, cultural, and even polit- 
ical perspective. Compared to conventional circuit switched networks, packet 
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switched networks are more cost effective and more efficient. Furthermore, 
adding new services and applications is obviously easier and more flexible 
on packet switched networks than on circuit switched networks. As a result, 
the convergence of circuit and packet switched networks has emerged and the 
trend will continue into the future. The third is the wide deployment of wireless 
communications. A decade ago most people barely knew of wireless personal 
communications. In those days, it was basically a techie’s vision. But today, 
it is no longer an electrical engineer’s dream. Wireless communications has 
been well accepted by the public and its business is growing bigger every day 
and everywhere. The wireless communications technology has evolved from 
first-generation analog systems to second-generation digital systems, and will 
continue advancing into its third generation, which is optimized for both voice 
and data communication services. The last but not least significant develop- 
ment is the escalating demand for broadband access through such connections 
as digital subscriber line (DSL) or cable to the Internet. Broadband access 
enables a large number of prospective bandwidth-consuming services that will 
potentially make our work more productive and our leisure more rewarding. 
These developments are shaking the foundations of the telecommunications 
industry and it can be foreseen that tomorrow’s communications will be carried 
out over fast, high-capacity packet-switched networks with mobile, broadband 
access anytime and anywhere. 

Packet switched networks so far have achieved great success for transferring 
data in an efficient and economical manner. Data communications have en- 
abled us to acquire timely information from virtually every comer of the world. 
Our intuition may tell that the faster a network could be, the more favorable it 
becomes. But there is a lack of perceived benefit from paying more to gain an- 
other further quantum leap to even faster networks. We beheve that it is more 
imperative and more urgent to introduce innovative communication services 
that keep up with the aforementioned four developments in communication 
technologies. Multimedia communications for telecollaboration (for example 
teleconferencing, distant learning, and telemedicine) over packet switched net- 
works is one of the most promising choices. The features that it introduces will 
more profoundly enhance peoples’ life in the way they communicate, and will 
bring remarkable values to service providers, enterprises, and end-users. 

For a collaboration, full-scale interaction and a sense of immersion are es- 
sential to put the users in control and to attain high collaborative productivity 
in spite of long distances. In this case, not only messages would be exchanged, 
but also experiences (sensory information) need to be shared. Experiences are 
inherently composed of a number of different media and advanced multimedia 
technologies are crucial to the successful implementation of a telecollabora- 
tion system. The desire to share experiences has been and will continue to be 
the motivating factor in the development of exciting multimedia technologies. 
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Multimedia communication differs from traditional communication modes in 
that it is no longer constrained by one given medium. It selects appropriate 
media according to the content and combines messages and experiences to- 
gether. With enriched experiences, remote environments can be reproduced 
as faithfully as possible so that local users can make full use of both their 
binaural hearing and binocular vision. Such an immersive interface makes it 
easier to determine who is talking and helps understand better what is being 
discussed, particularly when there are multiple participants. Full-scale inter- 
action differentiates collaboration from exhibition although both are possibly 
powered by multimedia. Interaction establishes two channels of information 
flow from and to a user, which makes communication more effective. This 
can be well recognized by considering the effectiveness of a lecture with and 
without allowing the audience to raise questions. Full-scale interaction and a 
sense of immersion are indeed the two most important features of collabora- 
tion, and we cannot afford to intentionally sacrifice them anymore in building 
next-generation communication systems. 

Evidently, the most powerful way to conduct full-scale interaction and to 
create a sense of immersion in telecollaboration is with both visual and audio 
properly involved. But due to space limitation and the authors’ expertise, this 
book will focus exclusively on the processing, transmission, and presentation 
of audio and acoustic signals in multimedia communications for telecollabo- 
ration. The ideal acoustic environment that we are pursuing is referred to as 
immersive acoustics, which demands at least full-duplex, hands-free, and spa- 
tial perceptibility. As a result, we confront remarkable challenges to address 
a number of complicated signal processing problems, but at the same time 
possess tremendous opportunities to develop more practically useful and more 
computationally efficient algorithms. These challenges and opportunities will 
be detailed in the following section. 

2. CHALLENGES AND OPPORTUNITIES 

Prior to the Internet, voice communications was accomplished with the pub- 
lic switched telephone network (PSTN). Over the duration of a call, there exists 
a physical connection between the two users. In this arrangement, the earpiece 
at one end is used as the the speaker’s extended mouth or articulator at the other 
end. This mode of conversation allows fuII-dupIex and even hands-free com- 
munication with the help of a speakerphone. But because of the use of only one 
microphone and one loudspeaker, a sense of spatialization cannot be rendered, 
and the listener is unable to obtain a vivid impression of the remote speaking 
environment. Adding video might help, but the hearing experience is still not 
enjoyable for sure. Multiple microphones and loudspeakers must be employed 
to precisely record and faithfully reproduce the remote acoustic environment. 
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The television industry has witnessed a successful evolution of audio technol- 
ogy from now obsolete mono to prevaihng stereo, and on to the highly desirable 
home theater with 5.1 channels. Therefore, it does not seem to be an exagger- 
ation to believe that the multichannel mode will be the eventual technique of 
choice in multimedia communication systems. The transmission of real-time 
multichannel audio signals (possibly video signals as well) definitely consumes 
a larger bandwidth than before and the limited bandwidth of a traditional tele- 
phone connection will prevent us from implementing advanced multichannel 
communication concepts. In contrast, packet switched networks are flexible in 
allocating bandwidth for a particular service. With the increasing improvement 
of Quality-of-Service (QoS), packet switched networks will provide the needed 
physical connections for multimedia communications. Figure 1.1 depicts the 
differences between the traditional voice communication system over a circuit 
switched telephony network and the new multimedia communication system 
for telecollaboration over a packet switched network. 

At the receiving room of next-generation multimedia communication sys- 
tems, we aim at constructing a spatially sensible sound stage using multiple 
loudspeakers with object-oriented multiple-participant management. A key 
technical challenge at the transmitting room would be our ability to acquire 
high-fidelity speech while keeping speakers’ spatial information with multiple 
microphones. In this case, we work with a complicated multiple-input multiple- 
output (MIMO) system. Consequently, a number of signal processing problems 
need to be addressed in the following broad areas: speech acquisition and en- 
hancement; acoustic echo cancellation; sound source localization and tracking; 
source separation; audio coding; and realistic sound stage reproduction. In this 
book, we invited well-recognized experts to contribute chapters covering the 
state-of-the-art in the research of these focused fields. 

3. ORGANIZATION OF THE BOOK 

The main body of this book is organized into four parts, each of which 
is composed of three chapters. Part I is devoted to the speech acquisition 
and enhancement problem. Part II provides a detailed exposition of theory 
and algorithms for solving the multichannel echo cancellation problem and 
presents a successfully implemented real-time system. Part III concerns the 
source localization/tracking problem and presents a unified treatment for blind 
source separation of convolutive mixtures. Part IV explores audio coding and 
realistic sound stage reproduction. 

Chapter 2 by Elko contains an updated version of a chapter that appeared 
in the earlier book. Acoustic Signal Processing for Telecommunication edited 
by Gay and Benesty. This new version combines the development of optimal 
arrays for spherical and cylindrical noise cases in a more sea ml ess exposition. 




Introduction 5 





storagc/rctricval 



I 




t t t 




\ \ L 


Multichannel Aggregate 


. I’acket 


Multichannel Aggregate 


Processor 


Network^'* 


Processor 



Mulliclianiicl Aggregate 
Processor 



TT 



“T 

(b) 



ntclligcnt assistance 



Figure 1 . 1 Illustration of the difference between (a) the traditional voice communication system 
over a circuit switched telephony network and (b) the new multimedia communication system 
for telecollaboration over a packet switched network. 



The chapter covers the design and implementation of differential arrays that are 
by definition superdirectional in nature. Aside from their small size, one of the 
most beneficial features of differential arrays is the inherent independence of 
their directional response as a function of frequency. This quality makes them 
very desirable for the acoustic pickup of speech and music signals. The chapter 
covers several optimal differential arrays that are useful for teleconferencing 
and speech pickup in noisy and reverberant environments. A key issue in the 
design of differential arrays is their inherent sensitivity to noise and sensor 
mismatch. This important issue is covered in the chapter in a general way to 
enable one to assess the robustness of any differential array design. 







6 Audio Signal Processing 



Chapter 3 by Meyer and Elko describes a new spherical microphone array 
that enables accurate capture of the spatial components of a sound field. The ar- 
ray performs an orthonormal spatial decomposition of the sound pressure field 
using spherical harmonics. Sufficient order decomposition into these orthogo- 
nal spherical harmonic spatial modes (called eigenbeams) allows one to realize 
much higher spatial resolution than traditional recording systems, thereby en- 
abling more accurate sound field capture. A general mathematical framework 
is given in the chapter where it is shown that these eigenbeams form the basis 
of a scalable representation that enables one to compute and analyze the spatial 
distribution of live or recorded sound fields in a computationally very efficient 
manner. Experimental results are shown for a real-time implementation which 
shows that the theory based on spherical harmonic eigenbeams matches the 
measured experimental data. 

In many telecommunications applications, speech communications are de- 
graded by the presence of background noise. As discussed in Chapter 4 by 
Diethom, digital signal processing can be used to reduce the level of the noise 
for the purpose of enhancing the quality of transmitted speech. There are two 
main categories of approaches to noise reduction: spatial-acoustic processing 
methods (e.g., beamforming) and non-spatial noise suppression, or noise strip- 
ping, methods. The former category is discussed in Chapter 2 of this text. The 
latter category includes Wiener filtering and short-time spectral modification 
techniques that attempt to increase the speech-signal-to-noise ratio from knowl- 
edge of the noisy speech signal alone. Chapter 4 reviews the most popular of 
these methods, provides some perspective on their origin, and demonstrates a 
noise suppression method using actual recorded speech. 

Adaptive algorithms play an important role in audio signal processing. 
Whenever we need to estimate and track an acoustic channel with or without a 
reference (input) signal, adaptive filtering is the best tool to use. In Chapter 5 by 
Benesty, Gansler, Huang, and Rupp, a large number of multichannel adaptive 
algorithms, both in the time and frequency domains, are discussed. This dis- 
cussion is developed in the context of multichannel acoustic echo cancellation 
where we have to identify a multiple-input multiple-output (MIMO) system 
(e.g., room acoustic impulse responses). 

Double-talk detectors (DTDs) are vital to the operation and performance of 
acoustic echo cancelers. In Chapter 6 by Gansler and Benesty, important aspects 
that should be considered when designing a DTD are discussed. The generic 
double-talk detector scheme and fundamental means for performance evaluation 
are discussed. A number of double-talk detectors suitable for acoustic echo 
cancelers are presented and objectively compared using their respective receiver 
operating characteristic. 

In Chapter 7 by Gansler, Eischer, Diethorn, and Benesty, the design of a 
real-time software acoustic echo canceler running natively under the Windows 
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operating systems on personal computers is presented. With this software, 
teleconferencing is possible in wideband stereo audio over commercial IP net- 
works in point-to-point as well as multi-point communication scenarios. The 
main challenge for such an implementation is to achieve sample-synchronized 
input and output streams for audio. This is required by the echo cancellation 
algorithm to maintain stable performance. Methods that achieve stable perfor- 
mance on hardware from various manufacturers are described. Furthermore, 
stereophonic echo cancellation is significantly more complicated to handle than 
the monophonic case because of computational complexity, nonuniqueness of 
solution, and convergence problems. In this chapter, the core algorithms are 
described including a powerful double-talk detection unit, a fast frequency- 
domain RLS adaptation algorithm which is optimized for non-Gaussian noise 
distributions, and residual echo and noise suppression. Simulation results are 
given which show that the algorithms achieve the theoretical bound on per- 
formance (echo attenuation). Furthermore, the software has been used for 
teleconferencing in wideband stereo audio over commercial IP networks. 

Chapter 8 by Chen, Huang, and Benesty focuses on time delay estimation 
(TDE) in a reverberant environment. Particular attention is paid to the robust- 
ness of aTDE system with respect to multipath and reverberation effects. Three 
broad categories of approaches are studied: generalized cross-correlation, mul- 
tichannel cross-correlation, and blind channel identification. The strengths and 
weaknesses of these approaches are elaborated and their performance (with 
respect to noise and reverberation) is studied and compared. 

Chapter 9 by Huang, Benesty, and Elko provides an overview of fundamental 
concepts and a number of cutting-edge algorithms for acoustic source localiza- 
tion with passive microphone arrays. The localization problem is postulated 
from the perspective of estimation theory and the Cramer-Rao lower bound for 
unbiased location estimators is derived. After an insightful review of conven- 
tional approaches ranging from maximum likelihood to least squares estimators, 
a recently developed linear-correction least-squares algorithm is presented that 
is more robust to measurement errors and that is more computationally as well 
as statistically efficient. At the end of Chapter 9, the design and implementation 
of a successful real-time acoustic source localization/tracking system for video 
camera steering in teleconferencing is described. 

Chapter 10 by Buchner, Aichner, and Kellermann presents a unified treat- 
ment of blind source separation algorithms for convolutive mixtures as arising 
from reverberant acoustic environments. Based on an information-theoretical 
approach, time-domain and frequency-domain algorithms are described, ex- 
ploiting three fundamental signal properties, as is typical, e.g., for speech and 
audio signals: nonwhiteness, nonstationarity, andnon-Gaussianity. This frame- 
work covers existing and novel algorithms and provides both new theoretical 
insights and efficient practical realizations. 




8 Audio Signal Processing 



Chapter 1 1 by Schuler describes the basics of perceptual audio coding and 
recent developments about audio coding targeted for communications applica- 
tions. It includes the fundamentals of psycho-acoustic models as they are used 
in today’s audio coders and the principles of filter bank structures and design. 
Further, it describes the structure and function of standard audio coders, and 
explains why this structure leads to too much end-to-end delay for communi- 
cations applications. Low delay audio coders and a new structure for audio 
coding for communications applications are then presented. 

Conventional approaches to reproducing a spatially sensible sound stage are 
not capable of immersing a large number of listeners. Chapter 12 by Spors, 
Teutsch, Kuntz, and Rabenstein describes a novel technique called wave field 
synthesis, which is developed to overcome this problem. It is based on the 
principles of wave physics and is suitable for implementation with current mul- 
tichannel audio hardware and software products. The listeners are not restricted 
in number, position, or activity and are not required to wear headphones. Fur- 
thermore, the listening area can be of arbitrary size. This chapter also addresses 
the problem of spatial aliasing and the compensation of non-ideal properties of 
loudspeakers and listening rooms, which is important for a successful imple- 
mentation of wave field synthesis systems. 

In Chapter 13 by Avendano, techniques based on binaural signal processing 
are presented that are capable of encoding and rendering sound sources accu- 
rately in three-dimensional space. In telecollaboration environments, where 
realistic sound-stage reproduction is a requirement for immersion, these sys- 
tems offer multiple advantages. One of these is the reduced number of playback 
channels necessary to render the sound stage. Given that the human hearing 
mechanism is binaural, all spatial information available to the listener is en- 
coded within two acoustic signals reaching the ears; thus in principle only two 
playback channels are necessary and sufficient to realistically render spatial 
sound. The decoding mechanisms that humans use to process spatial sound 
and the acoustic mechanisms that encode this information is described. With 
this knowledge, the design of virtual spatial sound (VSS) systems with appli- 
cations to telecollaboration is discussed. 
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Abstract Noise and reverberation can seriously degrade both microphone reception and 
loudspeaker transmission of audio signals in telecommunication systems. Direc- 
tional loudspeakers and microphone arrays can be effective in combating these 
problems. This chapter covers the design and implementation of differential 
arrays that are by definition small compared to the acoustic wavelength. Differ- 
ential arrays are therefore superdirectional arrays since their directivity is higher 
than that of a uniformly summed array with the same geometry. Aside from 
their small size, another beneficial feature of differential arrays is the inherent 
independence of their directional response as a function of frequency. Deriva- 
tions are included for several optimal differential arrays that may be useful for 
teleconferencing and speech pickup in noisy and reverberant environments. Ex- 
pressions and design details covering optimal multiple-order differential arrays 
are given. The results shown in this chapter should be useful in designing and 
selecting directional microphones for a variety of applications. 

Keywords: Acoustic Arrays, Beamforming, Directional Microphones, Differential Micro- 

phones, Room Acoustics 



1. INTRODUCTION 

Acoustic noise and reverberation can seriously degrade both the microphone 
reception and the loudspeaker transmission of speech signals in communica- 
tion systems. The use of small directional microphones and loudspeakers can 
be effective in combatting these problems. First-order differential microphones 
have been in existence for more than 50 years. Due to their directional (farfield) 
and close-talking properties (nearfield), they have proven essential for the re- 
duction of feedback in public address systems and for communication systems 
in high noise and reverberant environments. In telephone applications, such as 
speakerphone teleconferencing, directional microphones can be very effective 
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at reducing noise and reverberation. Since small differential arrays can offer 
significant improvement in teleconferencing configurations, it is expected that 
they will become standard components for audio communication devices in the 
future. 

Work on various fixed and adaptive differential microphone arrays was 
started at the Acoustics Research Department at Bell Labs in the late 1980’s. 
The contents of this chapter represent some of the fundamental work that was 
done at Bell Labs. The main focus of this chapter is to show the development of 
some of the necessary analytical expressions in differential microphone array 
design. Included are several results ofpotential interest to a designer of such mi- 
crophone systems. Various design configurations of multiple-order differential 
arrays optimized under various criteria are discussed. 

Generally, designs and applications of differential microphones are illus- 
trated. Since transduction and transmission of acoustic waves are commonly 
reciprocal processes, the results presented here are also applicable to loudspeak- 
ers. However, differential loudspeaker implementations are problematic since 
large volume velocities are required to radiate sufficient acoustic power. The 
reasons are twofold: first, the source must be small compared to the acous- 
tic wavelength; second, the real-part of the radiation impedance becomes very 
small for differential operation. Another additional factor that must be carefully 
accounted for in differential loudspeaker array design is the mutual radiation 
impedance between array elements. Treatment of these additional factors in- 
troduces analytical complexity that would significantly increase the exposition 
presented here. Since the goal is to introduce the essential concepts of differ- 
ential arrays, the chapter focusses exclusively on the microphone array design 
problem. 

2. DIFFERENTIAL MICROPHONE ARRAYS 

The term first-order differential array applies to any array whose response 
is proportional to the combination of two components: a zero-order (acoustic 
pressure) signal and another proportional to the first-order spatial derivative of a 
scalar acoustic pressure field. Similarly, the term -order differential array is 
used for arrays that have a response proportional to a linear combination of signal 
derived from spatial derivatives up to, and including n. Differential arrays have 
higher directivity than that of a uniformly weighted delay-sum array having the 
same array geometry. Arrays that have this behavior are sometimes referred 
to as superdirectional arrays. Microphone array systems that are discussed in 
this chapter respond to finite-differences of the acoustic pressure that closely 
approximate the pressure differentials for general order. Thus, the interelement 
spacing of the array microphones is much smaller than the acoustic wavelength 
and the arrays are therefore inherently superdirectional. Typically, differential 
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arrays combine the outputs of closely-spaced microphones in an alternating sign 
fashion. Thus, differential arrays are also occasionally referred to as pattern- 
differencing arrays. 

Before discussing various implementations of -order finite-difference sys- 
tems, expressions are developed for the -order spatial acoustic pressure 
derivative in a direction r (the bold-type indicates a vector quantity). Since 
realizable differential arrays are approximations to acoustic pressure differen- 
tials, equations for general order differentials provide significant insight into 
the operation of these systems. 

The acoustic pressure field for a propagating acoustic plane-wave can be 
written as 

p{k,r,t) = = p^^j(<^t-krcosB)^ (2.1) 

where Pq is the plane-wave amplitude, to is the angular frequency, T is the 
transpose operator, k is the acoustic wavevector (||k|| = k — oj/c — 27t/A and 
A is the acoustic wavelength), c is the speed of sound, and r — ||r|| where r is the 
position vector relative to the selected origin. The angle d is the angle between 
the position vector r and the wavevector k. Dropping the time dependence and 
taking the n^^-order spatial derivative along the direction of the position vector 
r yields: 

^p{k,r) = Poi-jkcosdfe-^'^^^^^^. ( 2 . 2 ) 

The plane-wave solution is valid for the response to sources that are “far” from 
the microphone array. The term “far” implies that the distance between source 
and receiver is many times the square of the relevant source dimension divided 
by the acoustic wavelength. Using (2.2) one can conclude that the -order 
differential has a bidirectional pattern component with the shape of (cos0)". 
It can also be seen that the frequency response of a differential microphone 
is high-pass with a slope of 6n dB per octave. If the far-field assumption is 
relaxed, the response of the differential system to a point source located at the 
coordinate origin is 

„-j(kr cos 6) 

p{k,r) = Po . (2.3) 

r 

The n^^-order spatial derivative in the radial direction r is 



cP 

dr'^ 



p{k, r, 6) = Po 



-~ikr COS0 

^n-l-1 



(-1)" E 

m =:0 



{jkr cos 6)^ 
m! 



(2.4) 



where r is the distance to the source. A fundamental property for differential 
arrays is that the general w‘^-order array response is a weighted sum of bidi- 
rectional terms of the form cos” 9. This property will be used in later sections. 
First, though, the effects of the finite-difference approximation for the spatial 
derivatives are investigated. 




14 Audio Signal Processing 



If the first-order derivative of the acoustic pressure field is expanded into 
a spatial Taylor series, then zero and first-order terms are required to express 
the general spatial derivative. The resulting equations are nothing other than 
finite-difference approximations to spatial derivatives. As long as the element 
spacing is small compared to the acoustic wavelength, the higher-order terms 
(namely, the higher-order derivatives) become insignificant over a desired fre- 
quency range. The resulting approximation can he expressed as the exact spa- 
tial derivative multiplied hy a bias error term. For the first-order case and for a 
plane-wave acoustic field (2.2), the pressure derivative is 

= -jkP, (2.5) 

dr 



A finite-difference approximation for the first-order system can be defined as 



Ap{k,r,6) _ p{k,r + d/2,9) — p{k,r — d/2,9) 
Ar d 

_ —j2Po s'm{kd/2 ® 

d ’ 



( 2 . 6 ) 



where d the distance between the two microphones, 
amplitude bias error e\ as 

_ Ap/Ar 
dp /dr ’ 

then on-axis (9 = 0), 



If we now define the 
(2.7) 



sinkd/2 simrd/\ 
kd/2 -nd/X 



( 2 . 8 ) 



Figure 2.1 shows the amplitude bias error ei between the true pressure dif- 
ferential and the approximation by the difference between two closely-spaced 
omnidirectional (zero-order) microphones. The bias error is plotted as a nondi- 
mensional function of microphone spacing divided by the acoustic wavelength 
(d/X). From Fig. 2.1, it can be seen that the element spacing must be less than 
1/4 of the acoustic wavelength for the error to be less than 1 dB. Similar equa- 
tions can be written for higher-order differences. The relative approximation 
error for these systems is low-pass in nature if the frequency range is limited to 
the small kd range. 

In general, to realize an array that is sensitive to the derivative of the 
incident acoustic pressure field, we require m -order microphones, where, 
m -F p — 1 — n. For example, a first-order differential microphone requires 
two zero-order microphones. 

A first-order differential microphone is typically fabricated with a single 
movable membrane that is open to the sound field on both sides. Since the 
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Figure 2. 1 Relative finite-difference amplitude error in dB for a plane- wave propagating along 
the microphone axis, as a function of element spacing divided by the acoustic wavelength. 

microphone responds to the pressure-difference across the membrane, a simple 
relationship to the acoustic particle velocity can be obtained from the linearized 
Euler equation for an ideal (no viscosity) fluid. Euler’s equation can be written 
as 

dw 

-Vp = p—, (2.9) 

where p is the fluid density and v is the acoustic particle velocity. The time 
derivative of the particle velocity is proportional to the pressure-gradient. Eor 
an axial component of a the velocity vector, the output is proportional to the 
pressure differential along that axis. Thus, a first-order differential microphone 
is one that responds to both the scalar pressure and a component of the particle 
velocity of the sound field at the same measurement position. The design 
of higher order microphones can be formed by combinations of lower-order 
microphones where the sum of all of component microphone orders is equal to 
the desired differential microphone order. 

It is instructive to analyze a first-order differential microphone to establish 
the basic concepts required in building and analyzing higher-order differential 
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Figure 2.2 Diagram of first-order microphone composed of two zero-order (omnidirectional) 
microphones. 

arrays. To begin, let us examine the simple two-element differential array as 
shown in Fig. 2.2. For a plane-wave with amplitude Pq and wavenumber k 
incident on a two-element array, the output can be written as 

Ei{k,e) = Po(^l - ^-jkdco^6\^ ^ (2.10) 

where d is the interelement spacing and the subscript indicates a first-order dif- 
ferential array. Note again that the explicit time dependence factor is neglected 
for the sake of compactness. If it is now assumed that the spacing is much 
smaller than the acoustic wavelength, then 

E\{k, 9) ^ Pokd cos 9. (2.11) 

As expected, the first-order array has the factor cos 0that resolves the component 
of the acoustic particle velocity along the microphone axis. 

By allowing the addition of time delay between these two subtracted zero- 
order microphones, it is possible to realize general first-order directional re- 
sponses. For a plane-wave incident on this new array 

Ex{u),9) = Po(^l - e-Mr + dcos9/c)y (2.12) 

where T is equal to the delay applied to the signal from one microphone. For 
small spacing {kd <C tt and ujt tt), 

Ei{u,9) ^ PgU) {t + d/ccos9) . (2.13) 

One thing to notice about (2.13), is that the first-order array has a high-pass 
frequency dependence. The term in the parentheses in (2.13) contains the array 
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directional response. In the design of differential arrays, the array directivity 
function is the quantity that is of interest. To further simplify the analysis for 
the directivity of the first-order array, define ao, ai, and a\, such that 

at = ao = — tTT 
r -t- d/c 

and 

1 - Q!i = ai = (2.15) 

T -j- d/C 

From these definitions it can easily be seen that 

oo + ai = 1. (2.16) 

Thus, a normalized directional response can be written as 

(^) = ao + ai COS0 = -F (1 - qi) cos 0, (2.17) 

where the subscript N denotes the normalized response of a first-order system, 
i.e., EN^ (0) = 1. The normalization of the array response has effectively fac- 
tored out the term that defines the directional response of the microphone array. 
The most interesting thing to notice in (2.17) is that the first-order differential 
array directivity function is independent of frequency within the region where 
the assumption of small spacing compared to the acoustic wavelength holds. 
Note that the dependent variable cti is itself a function of the variables d and r. 

The magnitude of (2.17) is a parametric expression for the “lima 9 on of Pas- 
cal” algebraic curve. The two terms in (2.17) can be seen to be the sum of a 
zero-order microphone (first-term) and a first-order microphone (second term), 
which is the general form of the first-order array. Early unidirectional micro- 
phones were actually constructed by summing the outputs of a pressure mi- 
crophone and a velocity ribbon microphone (pressure-differential microphone) 
[12]. One implicit property of (2.17) is that for 0 < ori < 1, there is a maxi- 
mum at 0 = 0 and a minimum at an angle between 7 t/ 2 and tt. For values of 
ai > 1/2 the response has a minimum at 180°, although there is no null in 
the response. An example of the response for this case is shown in Fig. 2.3(a) 
and designs of this type are sometimes referred to as “subcardioid.” When 
tti = 1/2, the parametric algebraic equation has a specific form which is called 
a cardioid. The cardioid pattern has a null in the response at 0 = 180°. For 
values of 0 > ai < 1/2 there is a null at 90° < 0 < 180°. Figure 2.3(b) 
shows a directivity response corresponding to the case where cti = 0.2. For 
the first-order system, the solitary null is located at 

01 = cos~^(- — ) = cos“* (- — . (2.18) 

ai \l-aij 
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Figure 2.3 Directivity plots for first-order arrays (a) ai = 0.55, (b) qi = 0.20. 




Figure 2.4 Three dimensional representation of directivity in Fig. 2.3(b). Note that the viewing 
angle is from the rear-half plane at an angle of approximately 225° . The viewing angle was chosen 
so that the rear-lobe of the array would not be obscured by the mainlobe. 



Directivity patterns shown by Fig. 2.3 are actually a representation of a plane 
slice through the center line of the three-dimensional spherical coordinate direc- 
tivity plot. Arrays discussed in this chapter are rotationally symmetric around 
their axes since a linear array geometry is assumed. Figure 2.4 shows a three- 
dimensional representation of the directivity pattern shown in Fig. 2.3(b). 
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Figure 2.5 Construction of differential arrays as first-order differential combinations up to 
third-order. 

A general realization of a first-order differential response is accomplished by 
adjusting the time delay between the two zero-order microphones that comprise 
the first-order system model. From (2.14) and (2.15), the value of r determines 
the ratio of Oi/ao- The value of r is proportional to d/c, the propagation time 
for an acoustic wave to axially travel between the zero-order microphones. This 
interelement propagation time is 

d oq doc\ 
cai c(l — ai) 

From (2.19) and (2.18), the pattern zero is at 

= cos-' (-^) . 

An order array can be written as a sum of the -order spatial derivative 
and lower-order terms. An -order array can also be written as the product 
of n first-order response terms as 

En{u),e) = -Pofl ~ , (2.21) 

1=1 

where the d{ relate to the microphone spacings, and the Tj relate to chosen 
time delays. There is a design advantage in expressing the array response 
in terms of the products of first-order terms: it is now simple to represent 
higher-order systems as cascaded systems of lower order. Figure 2.5 shows 
how differential arrays can be constructed for up to third-order. Extension 
of the design technique to higher orders is straightforward. Values of Tj can 
be determined by using the relationships developed in (2.14) and (2.19). The 
ordering of the Ti is not important as long as tUTi tt. 



( 2 . 19 ) 

( 2 . 20 ) 
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If again it is assumed that kdi <C tt and wTi <C tr, then (2.21) can he 
approximated as 

n 

En{ijJ,d) w Pow"' (tj + dj/c COS0) . (2.22) 

Equation (2.22) can he further simplified hy making the same substitution as 
was done in (2.14) and (2.15) for the arguments in the product term. Setting 
CH = n/{Ti + di/c), then 

n 

En{u), 6 ) w Pqw" [aj + (1 - a,) cos 6 ] . (2.23) 

j=i 

If the product in (2.23) is expanded, a power series in cos 6 can he written for 
the response of the n*^-order array to an incident plane-wave can he written as 

P„(w, B) = PoAuj^ (oo + oi cos 6 + C 2 cos^ 6 + ... + On cos" 6 ) , (2.24) 

where the constant A is an overall gain factor and we have suppressed the ex- 
phcit dependence of the directivity function E on the variables d{ and rj for 
compactness. The only frequency dependent term in (2.24) is w".Thus the fre- 
quency response of an -order differential array can be easily compensated by 
a low-pass filter whose frequency response is proportional tow“". By choosing 
the structure that places only a delay behind each element in a differential array, 
the coefficients in the power series in (2.24) are independent of frequency, re- 
sulting in an array whose beampattern is independent of frequency. To simplify 
the following exposition on the directional properties of differential arrays, it is 
assumed that the amplitude factor can be neglected. Also, since the directional 
pattern described by the power series in cos 0 can have any general scaling, the 
normalized directional response can be written as solely a function of 6, 

^Nn W = do + ai cos 6 + d 2 cos^ 0 + ... + Un cos" 6, (2.25) 

where the subscript N denotes a normalized response at 0 = 0° [as in (2.17)] 
which implies 

n 

= 1. (2.26) 

j=0 

In general, n*^-order differential microphones have at most, n nulls (zeros). 
This follows directly from (2.25) and the fundamental theorem of algebra. 
Equation (2.25) can also be written in “canonic” form as the product of first- 
order terms 

n 

^Nn (^) = n ~ • 

t=l 



(2,27) 




Dijferential Microphone Arrays 21 



Note that the frequency dependent variable w in (2.25) and (2.27) has been 
dropped since it was shown that the frequency response for small interelement 
spacing is simply proportional to tu”. The terms ai in (2.25) can take on any 
desired value by adjusting the delays used in defining the desired differential 
microphone array. For the second-order array 

= 0.Q + cos 6 + 02 cos^ 6. (2.28) 

Equation (2.28) can also be factored into two first-order terms and written as 

Epi^{d) = [ai + (1 - ai)cos0][o:2 + (1 “ a2)cos0], (2.29) 

where 

0-0 — CtlO!2, 

ai = ai(l - a2) + 02(1 - ai), 

02 = (1 -ai)(l -«2), (2.30) 

or 

Oi = oq + Oj /2 + \/(ao+ciij2)^--^, 

0/2 — oq + a,\l2 + \J (ao + ai/2)^ — ao- (2.31) 

As shown, the general form of the second-order system is the sum of second- 
order, first-order and zero-order terms. If certain constraints are placed on the 
values of ao and ai, it can be seen that there are two nulls (zeros) in the interval 
0 < ^ < 7T. The array response pattern is symmetric about 6 = 0. These zeros 
can be explicitly found at the angles 6i and 62 ’. 

61 = cos~^ 

62 ^ cos~^ (1 -^ 0 2 ) ’ 

where now o\ and ar 2 can take on either positive or negative values. If the 
resulting beampattem is constrained to have a maximum at 0 = 0°, then the 
values of a\ and 0:2 can only take on certain values; we have ruled out designs 
that have a higher sensitivity at any angle other than ^ = 0°. An interesting 
thing to note is that negative values of a\ or CV2 correspond to a null moving 
into the front half-plane. Negative values of a\ for the first-order microphone 
can be shown to have a rear-lobe sensitivity that exceeds the sensitivity at 0° . 
Since (2.29) is the product of two first-order terms, emphasis of the rear-lobe 
caused by a negative value of 0 : 2 , can be counteracted by the zero from the 
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term containing «i. As a result, a beam-pattern can be found for the second- 
order microphone that has maximum sensitivity at 0 = 0° and a null in the 
front-half plane. This result also implies that the beamwidth of a second-order 
microphone with a negative value of Q2 is narrower than that of the second-order 
dipole (cos^( 0 ) directional dependence). 

It is straightforward to extend the previous results to the third-order and 
higher-order cases. For completeness, the equation governing the directional 
characteristics for the third-order array is 

JSatj (0) = ao + oi cos 0 -F «2 cos^ 0 + 03 cos^ 0. (2.34) 

If certain constraints are placed on the coefficients in (2.34), it can be factored 
into three real roots: 

= [o!i + (1 - tti) cos0][a2 + (1 - 02 ) cos0][q; 3 + (1 - as) cos 0]. 

(2.35) 

Third-order microphones have the possibility of three zeros that can be placed 
at desired locations in the directivity pattern. Solving this cubic equation yields 
expressions for a\, a 2 , and a3 in terms of oq, ai, 02, and 03. However, these 
expressions are algebraically cumbersome and not repeated here. 



3. ARRAY DIRECTIONAL GAIN 



In order to “best” reject the noise in an acoustic field, one needs to opti- 
mize how multiple microphones are linearly combined. Specifically we need 
to consider the directional gain, i.e. the gain of the microphone array in a 
noise field over that of a simple omnidirectional microphone. A common quan- 
tity used is the directivity factor Q, or equivalently, the directivity index DI 
[lOlog^o(Q)]. The standard definition for the directivity index is computed for 
general three-dimensional isotropic sound fields. However, there are possible 
conditions in room acoustics where a sound field can be reasonably modeled as 
a two-dimensional or cylindrical sound field. To logically denote equations for 
spherical and cylindrical fields, a subscript of C is used to denote the cylindrical 
two-dimensional case. 

A general expression for the directivity factor can be defined as 



47t I E{u},eo,(l>o) p 

lo'' lo I 



where the angles 0 and ^ are the standard spherical coordinate angles, 0q and 
(j)Q are the angles at which the directivity factor is being measured, E{u}, 0, (f) 
is the pressure response of the array, and u{u,9^(j)) is the distribution of the 
noise power. The function u is normalized such that 



47T 




u{uj, 0, <^) sin 6d0d(j> = 1. 



(2.37) 
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For a cylindrical sound field, the equation for a general directivity factor can be 
written as 

^ ^ N _ 27T I E{u),(f)o) P 

Qc{l^,<t>o) - r2Tr , pf 12 I 

Jo I w r 

Similarly, the distribution of the noise power is defined as 



1 

27T 



27T 



uc(co, 4>) d(t) = 1. 



(2.39) 



In general, the directivity factor Q can be written as the ratio of two Hermitian 
quadratic forms [3] as 

w^Aw 



Q = 



where 



w^Bw’ 



A = SoS^, 



(2.40) 



(2.41) 



w is the complex weighting applied to the microphones and 7^ represents the 
complex conjugate transpose. Elements of the matrix B are defined as 

2 r27r p'n 

^mn = ^ y J u{oj,e,(f))exp[jk^{rm - rn)]smed(j)de, (2.42) 

where r are the microphone element position vectors. For the cylindrical sound 
field case, 

bcmn = ^ “ Tn)] d(j). (2.43) 

The elements of the vector Sq are defined as 



Son = exp(jkjr„). 



(2.44) 



Note that for clarity the explicit functional dependencies of the above equations 
on the angular frequency w has been omitted. The solution for the maximum 
of Q, which is a Rayleigh quotient, is obtained by finding the maximum gener- 
alized eigenvector of the homogeneous equation 

Aw = Aa^Bw. (2.45) 

The maximum eigenvalue of (2.45) is given by 

Am = S^B-'So. (2.46) 



The corresponding eigenvector contains the weights for combining the elements 
to obtain the maximum directional gain 

Wopt == B“^Sq. 



(2.47) 
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This result states the general principle of “whiten and match,” where the matrix 
inverse B is the spatial whitening filter and the steering vector So is the matching 
to signals propagating from the source direction. In general, the optimal weights 
Wopf are a function of frequency, array geometry, element directivity and the 
spatial distribution of the noise field. 

4. OPTIMAL ARRAYS FOR ISOTROPIC FIELDS 

Acoustic reverberation in rooms has historically been modeled as spherically 
isotropic noise. A spherically isotropic noise field can be constructed by com- 
bining uncorrelated noises propagating in all directions with equal power. In 
room acoustics this modeled noise field is referred to as a “diffuse” sound field 
and has been the model used for many investigations into the statistical distri- 
butions of reverberant sound pressure fields. Although the standard to model 
for reverberant acoustic fields has been the “diffuse” field model (spherically 
isotropic noise), another noise field that is appropriate for room acoustics is 
cylindrical noise. In many rooms, where carpet and ceiling tiles are used, the 
main surfaces of absorption are the ceiling and the floor. As a result, a cylin- 
drical noise field model that has the noise propagating in the axial plane may 
be more appropriate. Since it is of interest to design microphone systems that 
optimally reject reverberant sound fields, the optimization of array gain will be 
given for both spherically and cylindrically “diffuse” sound fields. 

4.1 MAXIMUM DIRECTIONAL GAIN 

Publications on the maximization of the directional gain for an arbitrary 
array have been quite extensive [17, 18, 13, 4, 16, 8]. Uzkov [17] showed 
for uniformly spaced omnidirectional microphones that the directional gain 
reaches as the spacing between the elements goes to zero. A maximum 
value of directional gain is obtained when a collinear array is used in end-fire 
operation. Weston [18] has shown the same result by an alternative method. 
Parsons [13] has extended the proof to include nonuniformly spaced arrays 
that are much smaller than the acoustic wavelength. A proof given here also 
relies on the assumption that the elements are closely-spaced compared to the 
acoustic wavelength. The approach taken here is similar to that of Chu [4], Tai 
[16], and Harrington [8], who expanded the radiated and received field in terms 
of spherical wave functions. Chu did not look explicitly at the limiting case. 
Harrington did examine the limiting case of vanishing spacing, but his analysis 
involved approximations that are not necessary in the following analysis. 

4.1.1 Spherically Isotropic Fields. For a spherically isotropic field and 
for omnidirectional microphone elements, 

u{(jj, 9,(1)) = 1. 



(2.48) 
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In general, the directivity of N closely-spaced microphones can be expanded 
in terms of spherical wave functions. Now let us express the farfield pressure 
response E{0,<t>) as a summation of orthogonal Legendre polynomials, 

N-l n 

E{9, = hnmPn[cos{6 - 9^)] cosm(^ - (2.49) 

n =0 m =0 



where the sum has been limited to the number of degrees of freedom in the N- 
element microphone case, where the are the associated Legendre functions, 
and 9z and (f)^ are possible rotations of the coordinate system. Now define 

Gnm{9, 4>) = Pn[cOs{9 - COS m{4> - (^ 2 ). (2.50) 



The normalization of the function Gnm is 

r2ix 



Nr,. 



p2'K p\ 

= J cos^rruj) J [P^{t})]‘^ dr)d^ 



47r(n + m)! 



em( 2 n + l)(n - m)!’ 



where the function Sm is defined as 






"{2 



m = 0 
2 m > 0 ■ 



(2.51) 



(2.52) 



By using the orthogonal Legendre function expansion expressions for the di- 
rectivity factor can be written as 



Q{9o,(l>o) = ^n 



H)n=0 Sm=0 ^nmGnm(9o, <^o)| 



spN-l-^n ,2 AT 

Z_/m=0 



(2.53) 



To find the maximum of (2.53), we set the derivative of Q with respect to h„m 
for all n and m to zero. The resulting maximum occurs when 



Qn 



N-l n 

= EE 

n=0 m— 0 
N-\ n 

= EE 

n=0 m=0 
N -\ n 

i EE 

n=0 m=0 



4ir [Gnm{9o, 



N 



nm 



47t IP^(cos(9q - g;)) cos m{(j)o - <j)z)f 

Nnm 

47T [P„"(cos(go - ^ 2 ))]" 

^ nm 



(2.54) 
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The inequality in (2.54) infers that the maximum must occur when (pg = 

This result can be seen by using the addition theorem of Legendre polynomials, 

F„(cosV') = V £m ■ — , - ^ P^Ccos 6>o)P^(cos 9 z) cos (m [00 - 

(2.55) 

The angle subtended by two points on a sphere is ip and is expressed in terms 
of spherical coordinates as 

cos Ip = cos 9o cos 6z + sin^osin^^ cos((/>o - (pz)- (2.56) 

Equation (2.55) maximizes for all n, when ip = 0. Therefore (2.54) maximizes 
when Og = Oz- Since 

m > ^ 

(2.54) can be reduced to 

N-l 

Qmax ~ ^ ( 2 h + 1 ) 

n=0 

= N^. (2.58) 

Thus, the maximum directivity factor Q for N closely-spaced omnidirectional 
microphones is N'^. A proof of the maximum directivity factor for general 
spacing does not appear to be tractable, however it is believed that the limit 
established in (2.58) is true for general microphone spacing. 



4.1.2 Cylindrically Isotropic Fields. For a cylindrically isotropic field 
we have uncorrelated plane waves arriving with equal probability from any 
angle wave vector direction that lies in the <p plane. The cylindrical directivity 
factor for this field is therefore defined as 



Qc{(^,(po) 



\E{u,<Po) P 

^ lo'' I E{uj, (p) P u(m, (p)d^ 



(2.59) 



Again, the general weighting function u allows for the possibility of a nonuni- 
form distribution of noise power and microphone element directivity. For 
isotropic noise fields and omnidirectional microphones. 



u{u), (p) = 1. 



(2.60) 



Following the development in the last section, we expand the directional re- 
sponse of A closely-spaced elements in a series of orthogonal cosine functions. 
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For cylindrical fields we can use the normal expansion in the (f) dimension: 

N-l 

E{(t>) ='Y^hm cos(m[</) - <i>^]). (2.61) 

m=0 

The normalization of these cosine functions is simply [1]: 

r’ 27 T 



Nr, 



- / 



cos^ {m(f>) dcj) 



2it 

For cylindrical fields, the directivity factor can therefore be written as 

|2 



(2.62) 



Qc{<f>o) 



27T Em=0 ^rn COs(m[(/>o - <t>z\) 



Z^m=0 



(2.63) 



The maximum is found by equating the derivative of this equation for Qc with 
respect to the hm weights to zero. The result is 



N-l 



^ /.A N _ cos'‘(m[<^o - (!>,]) 

QCmax{<Po) — / ^ rr • (2.64) 



m=0 

The equation for Qq maximizes when (f)g — (j)^. Therefore 

N-l 

Qcrnax — ^ ^ 

m=0 

= 2N-1. 



(2.65) 



The above result indicates that the maximum directivity factor of N closely- 
spaced omnidirectional microphones in a cylindrically correlated sound field is 
2N - 1. This result is apparently known in the microphone array processing 
community [5], but apparently a general proof that has not been published. 
One obvious conclusion that can be drawn from the above result is that the rate 
of increase in directivity factor as a function of the number of microphones is 
much slower for a cylindrically isotropic field than a spherically isotropic field. 

A plot comparing the maximum gain for microphone arrays containing up to 
ten elements for both spherically and cylindrically isotropic fields is shown in 
Fig. 2.6. There are two main trends that can easily be seen in Fig. 2.6. First, the 
gain in directivity index decreases as the number of elements (order) is increase. 
Second, the difference between the maximum gains for spherical and cylindrical 
fields is quite sizable and increases as the number of elements is increased. 
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Figure 2.6 Maximum gain of an array of N omnidirectional microphones for spherical and 
cylindrical isotropic noise fields. 



The first observation is not too problematic since practical differential arrays 
implementation are limited to third-order. The second observation shows that 
attainable gain in cylindrical fields might not result in required microphone 
array gains for desired rejection of noise and reverberation in rooms that have 
low absorption in the axial plane. 

4.2 MAXIMUM DIRECTIVITY INDEX EOR 
DIEEERENTIAL MICROPHONES 

As was shown in Section 2, there are an infinite number of possibilities for 
differential array designs. What is of interest here are the special cases that are 
optimal in some respect. For microphones that are axisymmetric, as is the case 
for all of the microphones covered here, (2.36) can be written in a simpler form: 

2 

I En{u},0,(I>) P sin 9d9 ’ 



QM- 



( 2 . 66 ) 
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and for the cylindrical field case, 



Qc(i^) = 



27T 

ft I ? # ’ 



(2.67) 



where it has also been assumed that the directions do , (j>o are in the direction 
of maximum sensitivity and that the array sensitivity function is normalized: 
I En{u), $ 0 , (f>o) 1 = 1. If we now insert the formula from (2.25) and carry out 
the integration, we find the directivity factor expressed in terms of ay as 



-1 



Q(®0> •••) ®n) — 



n 

E 


n 

E 


2=0 


y=0 




i-|-y even 



Qj dj 

1 + i + j 



( 2 . 68 ) 



The directivity factor for a general -order differential array (no normalization 
assumption) can be written as 



1 ~i 



Q{o,q, ..., On) — 



Y,ai 

.1=0 



lE 

1=0 



E 

j=0 

i'i-j even 



d%Qij 

I + i + j 



a^Ba 

^Ha’ 



(2.69) 



similarly, the cylindrical case can be written as 

Qc(ao, On) - ^THca’ 
Where H and He are Hankel matrices given by 



Hi,f = 



1 



1 + i -h j 
0 



if i-t-j even 
otherwise 



and 



HeiJ 



{i + j - 1)!! 

< («■ + ;)!! 

0 



if i-t-j even 
otherwise 



The vector a is defined as 



(2.70) 



a — {aoj O] , ..., Otj) 



( 2 . 71 ) 
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Table 2.1 Table of maximum array gain Q, and corresponding eigenvector for differential 
arrays from first to fourth-order for spherically isotropic noise fields. 



order 


max eigenvalue 


eigenvector 


1 


4 


[1/4 3/4] 


2 


9 


[-1/6 1/3 5/6] 


3 


16 


[-3/32-15/32 15/32 35/32] 


4 


25 


[0.075 -0.300-1.050 0.700 1.575] 



and the matrix B is 


B - bb^, 


(2.72) 


where 


n-f-l 






= {1,1,.. 


(2.73) 



From (2.69) we can see that the directivity factors Q and Qc, are both Rayleigh 
quotients for two hermitian forms. From Section 3, the maximum of the 
Rayleigh quotient is reached at a value equal to the largest generalized eigen- 
value of the equivalent generalized eigenvalue problem, 

Bx = AHx, (2.74) 

where again, A is the general eigenvalue and x is the corresponding general 
eigenvector. The eigenvector corresponding to the largest eigenvalue will con- 
tain the coefficients a{ which maximize the directivity factor Q and Qc- Since 
B is a dyadic product there is only one eigenvector x = H~ ^ b with the eigen- 
value b^H-^b. Thus 



max Q = Xm = b^H 'b, (2.75) 

a 

and similarly, 

max Qc -- Xm = b^HT*b. (2.76) 

a ^ 

Tables 2.1 and 2.2 give the maximum array gain (largest eigenvalue) and cor- 
responding eigenvectors (values of aj that maximize the directivity factor), for 
differential orders up to fourth-order. Note that the largest eigenvector has been 
scaled such that the microphone output is unity at 0 = 0“. Figure 2.7 contains 
plots of the directivity patterns for the differential arrays given in Table 2.1. 





32 



Audio Signal Processing 




Figure 2.8 Optimum directivity patterns for differential arrays in a cylindrically isotropic noise 
field for (a) first, (b) second, (c) third, and (d) fourth-order. 



Corresponding plots of the highest array gain directivity patterns for cylindrical 
sound fields in Table 2.2 are shown in Fig. 2.8. 

The directivity index (10 log^Q Q)is an extremely useful measure in quanti- 
fying the directional properties of microphones and loudspeakers. It provides a 
rough estimate of the relative gain in signal-to-reverberation for a directional mi- 
crophone in a diffuse reverberant environment. However, the directivity index 
might be misleading with respect to the performance of directional microphones 
in non-diffuse fields. 

4.3 MAXIMUM FRONT-TO-BACK RATIO 

Another possible measure for the “merit” of an array is the front-to-back re- 
jection ratio, i.e., the directional gain of the microphone for signals propagating 
to the front of the microphone relative to signals propagating to the rear. One 
such quantity was suggested by Marshall and Harry [11] which will be referred 
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to here as “F” for the front-to-back ratio. The ratio F is defined as 

lo^ I E{uj,e,(l)) p smeded(f) 



F{uj) = 



So'' f^/2 I p Sin9ded(f>' 



(2.77) 



where the angles 9 and (/) are again the spherical coordinate angles and E(a;,^, (p) 
is the far-field pressure response. For a cylindrical field, the front-to-back ratio 
can similarly be written as 



Fc 



IF I m p # 
j;/2 1 m #' 



(2.78) 



For axisymmetric microphones (2.77) can be written in a simpler form by 
uniform integration over (p, 



f;^^lE!y(co,9,<p)l^sin&d9 
i Bn{oj, 9, (p) (2 sin 9d9 



(2.79) 



Carrying out the integration of (2.79) and using the form of (2.25) yields the 
front-to-rear power ratio in terms of the weighting vector a, 



F{qoi ■■■1 cifi) — 



EE r 

1=0 ■ n 






j=0 



+ i + j 



EE 



t=0 



j=0 



(-l)^+-^ajaj 

1 + i + j 



-1 



a^Ba 



where H is a Hankel matrix given by 



If -.tiiiL 

i+i+j 



(2.80) 



(2.81) 



B is a special form of a Hankel matrix designated as a Hilbert matrix and is 
given by 

B;,,= - J - . . (2.82) 

1 + t + j 

Similarly, for the cylindrical case. 



Fc = ^ 
^ a^Hca’ 



where 



Bcii ~ 



r(i±i±2) 
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Table 2.3 Table of maximum F ratio and corresponding eigenvector for differential arrays from 
first to fourth-order for spherically isotropic noise fields. 



order 


max eigenvalue 


eigenvector 


1 


l+As/Z 


^i-Va u Val 


2 


1 27+48 \/7 


ft v/7 5 1 

l2(3+v/7) 3 +n/7 2(3+77)' 


3 




« [ 0.0184 0.2004 0.4750 0.3061] 


4 


SS151695 


[0.0036 0.0670 0.2870 0.4318 0.2107] 



Table 2.4 Table of maximum eigenvalue corresponding to the maximum fronl-to-back ratio 
and corresponding eigenvector for differential arrays from first to fourth-order, for cylindrically 
isotropic noise fields. 



order 


max eigenvalue 


eigenvector 


1 


7+4 73 


[72-1 2-72] 


2 


97r2-|-12\/^7r+88 

9n-2-88 


^ [0.103 0.484 0.413 ] 


3 


11556 


J«[0.002 0.217 0.475 0.286] 


4 


ss 336035 


[0.00430 0.07429 0.29914 0.42521 0.19705] 



and 



Hcij = (- 1 ) 



i+j 



2 > 



r(2±^)’ 



(2.83) 



where F is the Gamma function [1]. The matrices H, He, B, and Be are real 
Hankel matrices and are positive definite. The resulting eigenvalues are there- 
fore positive real numbers and the eigenvectors are real. Tables 2.3 and 2.4 
summarize the results for the maximum front-to-back ratios for differential ar- 
rays up to fourth-order. The maximum eigenvalue for the third and fourth-order 
cases result in very complex algebraic expressions and only numeric results are 
given. Plots showing the highest front-to-back power ratio directivity patterns 
given by the optimum weights in Tables 2.3-2.4 are displayed in Figs. 2.9-2.10. 
One observation that can be made by comparing the cylindrical and spherical 
noise results is that the directivity patterns are fairly similar. The differences 
are due to the lack of the sine term in the denominator of (2.36). The optimal 
patterns for cylindrically isotropic fields have smaller sidelobes in the rear-half 
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Figure 2.9 Directivity patterns for maximum front-to-back power ratio for differential arrays 
in a spherically isotropic noise field for (a) first, (b) second, (c) third, and (d) fourth-order. 



of the microphone since this area is not weighted down by the sine term. Ta- 
ble 2.5 summarizes the optimal microphone designs presented in this section. 
The table also includes columns for the 3 dB beamwidth and the position of the 
pattern nulls. Knowledge of the null positions for different array designs allows 
one to easily realize any higher-order array by combining first-order sections 
in a tree architecture as shown in Fig 2.5. 

The results summarized in Table 2.5 also show that there is relatively small 
difference between the optimal designs of differential arrays for the spherical 
and cylindrical isotropic noise fields. Typical differences between directional 
gains for either cylindrical or spherical isotropy assumptions are less than a 
few tenths of a dB; most likely an insignificant amount. Probably the most 
important detail to notice is that the rate of increase in the directional gain 
versus differential array order is much smaller for cylindrically isotropic fields. 
This conclusion was also shown earlier (see Fig. 2.6). 
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Table 2.5 Table of maximum directional gain and front-to-back power ratio for differential 
arrays from first to fourth-order, for cylindrically and spherically isotropic noise fields. 



Mic. 


Die 


Fc 


DI 


F 


Beamwidth 


Null(s) 


order 


dB 


dB 


dB 


dB 


degs 


degs 


Maximum gain for cylindrical noise 




4.8 


10.9 


5.9 


11.1 


1 12° 


120 


2nd 


7.0 


10.9 


9.4 


7.5 


65° 


72,144 


3rd 


8.5 


13.9 


11.8 


10.3 


46° 


51,103,154 


^th 


9.5 


13.9 


13.7 


8.9 


36° 


80,120,160 


Maximum gain for spherical noise 




4.6 


7.4 


6.0 


8.5 


105° 


109 


2 nd 


6.9 


9.7 


9.5 


8.5 


65° 


73,134 


3rd 


8.3 


12.4 


12.0 


11.2 


0 

00 


55,100,145 


^Ih 


9.4 


13.8 


14.0 


11.2 


0 

00 


44,80,117,152 


Maximum front-to-back ratio for cylindrical noise 




4.6 


12.8 


5.4 


10.9 


120° 


135 










23.4 


81° 


106,153 


yd 




40.6 


9.8 


37.0 


66° 


98,125,161 


yh 


7.8 


55.3 






















nsjjH 


125 


2nd 




25.1 


1^1 


24.0 


00 

0 

0 


104,144 


3rd 




39.2 


QQI 


37.7 


65° 






7.8 


53.6 




51.8 


57° 
























Differential Microphone Arrays 37 




Figure 2. 10 Directivity patterns for maximum front-to-back power ratio for differential arrays 
in a cylindrically isotropic noise field for (a) first, (b) second, (c) third, and (d) fourth-order. 

4.4 MINIMUM PEAK DIRECTIONAL RESPONSE 

Another approach that might he of interest, is to design differential arrays that 
have an absolute maximum sidelohe response. This specification would allow 
a designer to guarantee that the differential array response would not exceed a 
defined level over a given angular range, where suppression of acoustic signals 
is desired. 

The first suggestion of an equi-sidelohe differential array design was in a 
“comment” publication by V. I. Korenbaum [9], who only discussed a restricted 
class of -order microphones that have the following form: 

En^{6) = [ai -h (1 - ai)cos^]cos”“^ 6. (2.84) 

The restricted class defined by (2.84) essentially assumes that an -order 
differential microphone is the combination of an (n — l)*^-order dipolepattern 
and a general first-order pattern. The major reason for considering this restricted 
class is obvious; the algebra becomes very simple. Since we are dealing with 
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systems of order less than or equal to three, we do not need to restrict ourselves 
to the class defined by (2.84). 

A more general design ofequi-sidelobe differential arrays can be obtained by 
using the standard Dolph-Chebyshev design techniques [6]. With this method, 
one can easily realize any order differential microphone. Roots of the Dolph- 
Chebyshev system are easily obtained. Knowledge of the roots simplifies the 
formulation of the canonic equations that describe order microphones as 
products of first-order differential elements. 

To begin an analysis, the Chebyshev polynomials are defined as 



J cos(ncos ^x), — l<x<l 
^ cosh(ncosh“^ x), 1 < | x | 



(2.85) 



Chebyshev polynomials of ordernhave nreal roots for arguments between -1 
and 1, and their value grows proportional to x” for arguments with a magnitude 
greater than 1. Thus, designs of -order Chebyshev arrays require a transfor- 
mation of the variable x in (2.85). Using the substitution x = b + a cos 9 in 
(2.85), enables one to form a desired n** -order directional response that follows 
the Chebyshev polynomial over any range. At 6 = 0° , x = Xo = a + b, and 
the value of the Chebyshev polynomial is Tn{xo) > 1. Setting this value to the 
desired mainlobe to sidelobe ratio L, we have 



L = Tn{xo) = cosh(n cosh ^ Xo), 



( 2 . 86 ) 



or equivalently. 



Xo = a + b = cosh cosh ' . (2.87) 



The sidelobe at 0 = 180° corresponds tox = b — a = —1. Therefore 

Xo + 1 



a — 



b = 



2 ’ 

Xo - 1 



( 2 . 88 ) 



Since the zeros of the Chebyshev polynomial are readily calculable, the null 
locations are easily found. From the definition of the Chebyshev polynomial 
given in (2.85), the zeros occur at 



Xm — cos 



(2m - 1)7T 
2n 



, m = 1, ... ,n. 



(2.89) 



The nulls are therefore at angles 



9m = cos' 



— if Xo T 1 



Xo + 1 



(2.90) 
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4.5 BEAMWIDTH 

Another useful measure of the array performance is the beamwidth. The 
beam width can be defined in many ways. It can refer to the angle enclosed 
between the zeros of a directional response or the 3 dB points. We will use 
the 3 dB beamwidth definition in this chapter. For a first-order microphone 
response (2.17), the 3 dB beamwidth is simply 



9bi = 2 cos ^ 



— 2oo -f- \/2(ao -f Cl) 
2a\ 



(2.91) 



For a second-order array, the algebra is somewhat more difficult but still straight- 
forward. The result from (2.28) is 



0B2 — 2cos ^ 



■cii ^ cl\ + 2%/2[ 02 -f 01^2 T (1 ~ \/2 )cI0<^2] \ 

2o2 I 



' 



where it is assumed that 02 0. If fl 2 = 0, the microphone degenerates into a 

first order array and the beamwidth can be calculated by (2.91). 

Similarly, the beamwidth for a third-order array can be found although the 
algebraic form is extremely lengthy and is therefore not included here. 



5. DESIGN EXAMPLES 

For differential microphones with interelement spacing much less than the 
acoustic wavelength, the maximum directivity index is attained when all of 
the microphones are collinear. For this case, the maximum directivity index is 
20 logio(tt + 1) where n is the order of the microphone [13]. For first, second, 
and third-order microphones, the maximum directivity indices are, 6.0, 9.5, and 
12.0 dB respectively. Derivations of some design examples forn < 3 are given 
in the following sections. 

As indicated in (2.25), there are an infinite number of possible designs for 
-order differential arrays. Presently, the most common first-order micro- 
phones are: dipole, cardioid, hypercardioid, and supercardioid. The extension 
to higher orders is straightforward and is developed in later sections. Most of 
the arrays that are described in this chapter have directional characteristics that 
are optimal in some way; namely, the arrays are optimal with respect to one 
of the performance measures previously discussed: directivity index, front-to- 
back ratio, sidelobe threshold, and beamwidth. A summary of the results for 
first, second, and third-order microphones is given in Table 2.6 at the end of 
this section. 
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Figure 2.1 1 Directivity index of first-order microphone versus the first-order differential pa- 
rameter ai. 



5.1 FIRST-ORDER DESIGNS 

Before actual first-order differential designs are discussed it is instructive 
to first examine the effects of the parameter a\ on the directivity index DI, 
the front-to-back ratio F, and the beamwidth of the microphone. For first- 
order arrays a\ = Cq and ai = 1 — fti. Figure 2.11 shows the directivity 
index of a first-order system for values ofai between 0 and 1. The first-order 
differential microphone that corresponds to the maximum in Fig. 2.11 is given 
the name hypercardioid. When o:i= 0, the first-order differential system is a 
dipole. At Q;i= 1, the microphone is an omnidirectional microphone with 0 
dB directivity index. Figure 2.12 shows the dependence of the front-to-back 
ratio F on Q!i . The maximum F value corresponds to the supercardioid design. 
Figure 2.13 shows the 3 dB beamwidth of the first-order differential microphone 
as a function of a i . 

When a\ ^ 0.7, the 3 dB down point is approximately at 180°. Higher 
values of correspond to designs that are increasingly omnidirectional and 
are sometimes referred to as subcardioid in the literature. Figure 2.13 indicates 
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Figure 2.12 Front-to-back ratio of first-order microphone versus the first-order differential 
parameter qi. 



that the first-order differential microphone with the smallest beamwidth is the 
dipole microphone with a 3 dB beamwidth of 90°. 

5.1.1 Dipole. From Euler’s equation, it is evident that the dipole micro- 
phone is simply related to an acoustic particle-velocity microphone. The con- 
struction was described earlier; the dipole is normally realized as a diaphragm 
whose front and rear sides are directly exposed to the sound field. In (2.17), the 
dipole microphone corresponds the simple case where ap = 0, oi = 1, 

^D,W = cos^. (2.93) 

In Fig. 2. 14(a), a polar plot of the magnitude of (2.93) shows the classic cosine 
pattern for the microphone. 

A first-order dipole microphone directivity index is 4.8 dB and has a 3 dB 
beamwidth of 90°. A single null in the response is at ^ = 90°. One potential 
problem, however, is that it is bidirectional; in other words, the pattern is sym- 
metric about the axis tangential to the diaphragm or normal to the axis of two 
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Figure 2.13 3 dB beamwidth of first-order microphone versus the first-order differential pa- 
rameter Qi. 



subtracted zero-order microphones that form a dipole. From pattern symmetry 
it is evident that the front-to-back ratio is 0 dB. 



5.1.2 Cardioid. As shown earlier, all first-order patterns correspond to 
the “lima 9 on of Pascal” algebraic form. The special case of cvi = 1/2 is the 
cardioid pattern. The pattern is described by 



EcAO) 



1 + cos 6 
2 



(2.94) 



which is plotted in Fig. 2.14(b). Although the cardioid microphone is not 
optimal in directional gain or front-to-back ratio, it is the most commonly 
manufactured differential microphone. The cardioid directivity index is 4.8 dB, 
the same as that of the dipole microphone and the 3 dB beamwidth is 131°. A 
null in the response is located at ^ = 180° and the front-to-back ratio is 8.5 dB. 



5.1.3 Hypercardioid. The hypercardioid microphone has the distinction 
of having the highest directivity index of any first-order microphone. From the 
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Figure 2. 14 Various first-order directional responses, (a) dipole, (b) cardioid, (c) hypercardioid, 
(d) supercardioid. 

previous section that discussed optimal arrays in a spherically isotropic field, 
the first-order hypercardioid response can be written as 

= (2.95) 

Figure 2. 14(c) is a polar plot of the absolute value of (2.95). The 3 dB beamwidth 
is equal to 105° and the null is at 109°. The directivity index is 6 dB or 
10 logio(4), the maximum directivity index for a first-order system and the 
front- to-back ratio is 8.5 dB. 

5.1.4 Supercardioid. The name supercardioid is commonly used for 
a first-order differential design which maximizes the front-to-back received 
power. Apparently, the first reference to the supercardioid design appears in a 
1941 paper by Marshall and Harry [11]. A supercardioid is of interest since 
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of all first-order designs it has the highest front-to-hack power rejection for 
isotropic noise. A supercardioid response can he written as 



EsCi (6) = 



- 1 + (3 - >/3) cose 
2 



(2.96) 



Figure 2.14(d) is a plot of the magnitude of (2.96). The directivity index for the 
supercardioid is 5.7 dB with a 3 dB heamwidth of 115°. A null in the response 
is located at 125°. The front-to-hack ratio is 11.4 dB. 



5.2 SECOND-ORDER DESIGNS 

As with first-order systems, there are an unlimited number of second-order 
array designs. Since second-order microphones are not readily available on 
the market today, there are no “common” configurations. Two designs that 
have been suggested are the second-order cardioid and the second-order hy- 
percardioid [12, 15]. Another group of proposed differential microphones is a 
restricted class of equi-sidelobe designs for arbitrary order n [9]. This section 
presents some of these designs as well as non-restricted equi-sidelobe designs 
and also a variety of second-order differential array designs based on common 
first-order microphones. The general second-order form as given in (2.28) has 
three parameters, oq, ai, and 02 . Equivalently, the second-order differential is 
the product of two first-order differential forms as shown in (2.29). 

It is informative to plot the directivity index and front-to-hack ratio as a 
function of the canonical values of ai and 02 - One can then easily visualize 
how these measures change for different second-order designs. Figures 2.15 and 
2.16 depict the dependence of DI and F on the two independent parameters ai 
and 0:2 from (2.29). Both figures are plotted for values of otj and 0:2 between 
-1 and -1-1. Figure 2.15 shows the DI maximum value of 9.5 dB and the 
interval between the contours is 0.5 dB. The two peaks in the plot represent the 
same maximum and only the order of the product of first-order sections used to 
represent the second-order response has changed. Figure 2.16 shows the result 
for the front-to-hack ratio which has a maximum value of 24.0 dB; the contours 
are in 1 dB steps. 

5,2,1 Second-Order Dipole. By pattern multiplication, the second-order 
dipole directional response is the product of two first-order dipoles which have 
cos 9 response patterns given by 

Ed2{0) = cos'^0- (2.97) 

Figure 2.17(a) shows the polar magnitude response for this array. The directivity 
index is 7.0 dB, and by symmetry the front-to-hack ratio is 0 dB. The 3 dB 
heamwidth is 65°. 
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Figure 2.15 Contour plot of the directivity index DI in dB for second-order array versus ai 
and Q 2 . The contours are in 0.5 dB intervals. 



5.2.2 Second-Order Cardioid. In general the term second-order car- 
dioid implies that either first-order term in the second-order expression given 
in (2.29) can be a cardioid. A simple second-order cardioid corresponds to the 
Case when both first-order terms are of the cardioid form 



EcM = 



(1 + cos^)^ 
4 



(2.98) 



For this special case ai = «2 — 0.5 and the directivity index is 7.0 dB and 
both nulls fall at 0 = 180°. The front-to-back ratio is 14.9 dB. 

A more general form for a second-order array can be written as the product 
of a first-order array with that of a first-order cardioid. An equation for this 
second-order cardioid is, 



T? m - [«i + (^ -Q^i)cosg][l -Fcosg] 



(2.99) 



5.2.3 Second-Order Hypercardioid. The second-order hypercardioid 
has the highest directivity index of a second-order system; its directivity in- 
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Figure 2.16 Contour plot of the front-to-back ratio in dB for second-order arrays versus a\ 
and Q 2 . The contours are in 1 dB increments. 

dex is 9.5 dB. A derivation of the directivity pattern and the parameters that 
determine the second-order hypercardioid are contained in Section 3. The re- 
sults are: 

ai = ih 4= a; ±0.41, (2.100) 

V6 

q;2 = =F 4= ~ T 0.41. (2.101) 

v6 

These values correspond to the peaks in Fig. 2.15. Null locations for the second- 
order hypercardioid are at 73° and 134°. The front-to-back ratio is 8.5 dB, which 
is the same for the first-order differential cardioid and first-order differential 
hypercardioid. A polar response is shown in Fig. 2.17(c). 

5.2.4 Second-Order Supercardioid. The term second-order supercar- 
dioid designates an optimal design for the second-order differential microphone 
with respect to the front-to-back received power ratio. A derivation for the su- 
percardioid microphone was also given in Section 3 and the results (repeated 
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Figure 2.17 Various second-order directional responses, (a) dipole, (b) cardioid, (c) hypercar- 
dioid, (d) supercardioid. 



here) are: 



2 ± \/8 - 3\/7 



V7 - 2 T 



0.45, 0.20, 



0.20, 0.45. 



( 2 . 102 ) 



(2.103) 



These values correspond to the peaks in Figure 2.16. Figure 2.17(d) is a plot of 
the magnitude of the directional response. The directivity index is equal to 8.3 
dB, the nulls are located at 104° and 144°, while front-to-back ratio is 24.0 dB. 



5.2.5 Equi-Sidelobe Second-Order Differential. Since a second-order 
differential microphone has two zeros in its response it is possible to design a 
second-order microphone such that the two lobes defined by these zeros are at 
the same level. Figure 2. 1 8(a) shows the only second-order equi-sidelobe design 
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Figure 2.18 Various second-order equi-sidelobe designs, (a) Korenbaum design, (b) -15 dB 
sidelobes, (c) —30 dB sidelobes, (d) minimum rear half-plane peak response. 



possible using the form of (2.84). The directivity index of this Korenbaum 
second-order differential array is 8.9 dB. The beamwidth is 76° and the front- 
to-back ratio is 17.6 dB. 

We begin our analysis of second-order Chebyshev differential arrays begins 
by comparing terms of the Chebyshev polynomial and the second-order array 
response function. The Chebyshev polynomial of order 2 is 



T2{x) = 2x^-1 

= 26^ - 1 -f 4o6 cos 6 + 2a? cos"^ d. 



( 2 . 104 ) 
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Comparing like terms of (2.104) and (2.28) yields: 



ao = 


2&2 - 


L 




4ab 


ai = 


T’ 




2a^ 


«2 == 


T’ 



(2.105) 



where L is again the sidelobe threshold level. 

By substituting the results of (2.88) into (2.105), we can determine the nec- 
essary coefficients for the desired equi-sidelobe second-order differential mi- 
crophone: 



flo — 

ai = 
02 = 



where 



xl - 2x0 - 1 
25 
- 1 
5 ’ 

(xq -h 1 )^ 

25 ’ 



1 



Xo = cosh cosh ^ 5 j . 
Thus for the second-order differential microphone, 

,^fl-Xo±V2\ 



9\fi ■ cos' 



l+Xo 



( 2 . 106 ) 

(2.107) 



(2.108) 



The null locations given in (2.108) can be used along with (2.32) and (2.33) 
to determine the canonic first-order differential parameters cvi and o; 2 - Fig- 
ures 2.18(b) and 2.18(c) show the resulting second-order designs for -15 dB 
and -30 dB sidelobes respectively. The directivity indices for the two designs 
are respectively 9.4 dB and 8.1 dB. Null locations for the -15 dB sidelobes 
design are at 78° and 142°. By allowing a higher sidelobe level than the Ko- 
renbaum design (for 3:0 = 1 -|- \/ 2 , with 0\ — 90°), a higher directivity index 
can be achieved. In fact, the directivity index monotonically increases until the 
sidelobe levels exceed -13 dB; at this point the directivity index reaches its 
maximum at 9.5 dB, almost the maximum directivity index for a second-order 
differential microphone. For sidelobe levels less than -20.6 dB (Korenbaum 
design), both nulls are in the rear half-plane of the second-order microphone; 
null locations for the -30 dB sidelobes design are at 109 and 152 degrees. 
Equi-sidelobe second-order directional patterns always contain a peak lobe at 
0 = 180°. 
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One interesting design that arises from the preceding development is a 
second-order differential microphone that minimizes the peak rear half-space 
response. This design corresponds to the case where the front-lohe response 
level atO = 90° is equal to the equi-sidelohe level (forxo = 3). Figure 2.18(d) 
is a directional plot of this realization. The canonic first-order differential pa- 
rameters for this equi-sidelohe design are: 



ai = 



CX2 ~ 



5 ± 2^/2 

17 

5q:2v/2 

17 



(2.109) 



This design has a directivity index of 8.5 dB and nulls located at 149° and 98°, 
The front- to-hack ratio is 22.4 dB. 

Two other design possibilities can be obtained by determining the equi- 
sidelohe second-order design that maximizes either the directivity index DI or 
the front-to-hack ratio F. Figure 2.19 is a plot of the directivity and front-to- 
hack indices as a function of sidelohe level. As mentioned earlier, a -13 dB 
sidelohe level maximizes the directivity index at 9.5 dB. A sidelohe level of 
-27.5 dB maximizes the front-to-hack ratio. Plots of these two designs are 
shown in Fig. 2.20. Of course, an arbitrary combination of DI and F could 
also be maximized for some given optimality criterion if desired. 



5.2.6 Maximum Second-Order Differential DI and F Using Common 
First-Order Differential Microphones. Another approach to the design of 
second-order differential microphones involves the combination of the outputs 
of two first-order differential microphones. Specifically, the combination is 
a subtraction of the first-order differential outputs after one is passed through 
a delay element. If the first-order differential microphone can be designed to 
have any desired canonic parameter aj , then any second-order differential array 
can be designed. More commonly a designer will have to work with off-the- 
shelf first-order differential microphones, such as standard first-order designs. 
If the second-order design is constrained to standard first-order differential 
microphones, then it is not possible to reach the maximum directivity index 
and front-to-hack ratio. The following section discusses how to implement 
an optimal design with respect to directivity index and front-to-hack power 
rejection when using standard first-order microphone elements. 

Given a first-order differential microphone with the directivity function, 

-Ba'i(« 2,6') = «2 + (1 - «2)cos0, (2.110) 

where a 2 is a constant, it is of interest to know how to combine two of these 
microphones so that the directivity index is maximized. A maximum can be 
found by multiplying (2.110) by a general first-order response, integrating the 
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Figure 2.19 Directivity index (solid) and front-to-back ratio (dotted) for equi-sidelobe second- 
order array designs versus sidelobe level. 



square of this product from 0 = 0 to vr, taking derivative with respect to ai, and 
setting the resulting derivative to zero. The result is 



3 5a2 

8 ~ 8(2^^9or^n2^’ 



( 2 . 111 ) 



A plot of the directivity index for 0 < a2 < 1 is shown in Fig. 2.21. A maxi- 
mum value of 9.5 dB occurs when —a\ = 0:2 ~ 0.41. 

A similar calculation for the maximum front-to-back power response yields 
a rather long expression: 



/3 - - 12g| -h 13 q2 ~ 3 

24a2 — 6a2 -h 2 



( 2 . 112 ) 



where 

^ ~ 8a2 + Sal + 80:2 ~ 12a2 + - 20a2 + 5 . 

(2.113) 
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Figure 2.20 Directional responses for equi-sidelobe second-order differential arrays for, (a) 
maximum directivity index, and, (b) maximum front-to-back ratio. 



A plot of the front-to-back ratio for 0 < 0:2 < 1 is shown in Fig. 2.22. 

A maximum value of 24.0 dB occurs when a 2 ~ 0.45 and ai « 0.20, which 
are the values of the second-order supercardioid. By symmetry, the values of 
ai and a 2 can obviously be interchanged. The double peak at F - 24.0 dB in 
Figure 2.22 is a direct result of this symmetry. 

5.3 THIRD-ORDER DESIGNS 

Very little can be found in the open literature on the construction and design 
of third-order differential microphones. The earliest paper in which an actual 
device was designed and constructed was by B. R. Beavers and R. Brown in 1970 
[2]. The lack of any papers on third-order arrays is not surprising given both 
the extreme precision that is necessary to realize these arrays and the serious 
signal-to-noise problems (these problems are discussed in more detail in a later 
section). However, recent advances in low noise microphones and electronics 
support the feasibility of third-order microphone construction. With this in 
mind, the following section describes several possible design implementations. 

5.3.1 Third-Order Dipole. By pattern multiplication, the third-order 
dipole directional response is given by 

= (2-114) 



Figure 2.23(a) shows the magnitude response for this array. The directivity 
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Figure 2.21 Maximum second-order differential directivity index DI for first-order differential 
microphones defined by (2. 1 10). 



index is 8.5 dB, while the front-to-back ratio is 0 dB and the 3 dB beamwidth 
is 54°. 



53.2 Third-Order Cardioid. The terminology of cardioids is ambigu- 
ous for second-order arrays. For the third-order, this ambiguity is even more 
pervasive. Nevertheless an obvious array possibility is to form a cardioid by 
using the pattern multiplication of three first-order differential cardioids, then 



Ec^ — 



(1 + COS0)" 



(2.115) 



Figure 2.23(b) shows the directional response for this array. The three nulls 
all fall at 180°. The directivity index is 8.5 dB and the front-to-back ratio is 



21.0 dB. 



5.3.3 Third-Order Hypercardioid. A derivation for the third-order dif- 
ferential hypercardioid was given in Section 4.2. The results for the coefficients 
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Figure 2.22 Maximum second-order differential front-to-back ratio for first-order differential 
microphones defined by (2.1 10). 

in (2.34) are: 

ao = -3/32, 

ax = -15/32, 

= 15/32, 

tt3 = 35/32. (2.116) 

After solving for the roots of (2.34) with the coefficients given in (2.116), the 
coefficients of the canonic representation are: 

a\ — 1/2 \/5 cos (<^/3) — 1 / 2 ] «0.45, 

02 = 1/2 \/5 cos (</»/3 + 27 t/3)- 1 / 2 ] « 0.15, 

03 = 1/2 \/5 cos((/)/3 + 47r/3) - 1 / 2 ] Si -1.35, (2.117) 

where 

(j) — arccos f -2/\/5j . (2.1 18) 
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Figure 2.23 Various third-order directional responses, (a) dipole, (b) cardioid, (c) hypercar- 
dioid, (d) supercardioid. 



Figure 2.23(c) shows the directional response of the third-order differential hy- 
percardioid. The directivity index of the third-order differential hypercardioid 
is the maximum for all third-order designs: 12.0 dB. The front-to-hack ratio is 
11.2 dB and the three nulls are located at 6 = 55°, 100°, and 145°. 

5.3.4 Third-Order Supercardioid. The third-order supercardioid is the 
third-order differential array with the maximum front-to-hack power ratio. The 
derivation of this array was given in Section 4.3. The requisite coefficients Uj 
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are: 



0-0 — 



ai = 



02 = 



03 = 



V2V2I - 1 



0.018, 



21 + 9^/^ - \/2(6 + \/^) v/21 - V2I 



3[\/2(4 + V2T)V'21 - \Z2l- 25 - 5\/^] 



63 + 7s/Zl - ^/2(7 + 2V^)\/21 - 



0 . 200 , 

Si 0.475, 
i 0.306. (2.119) 



Figure 2.23(d) shows a directivity plot of the resulting supercardioid micro- 
phone. The directivity index is 9.9 dB and the front-to-back ratio is 37.7 dB. 
The nulls are located at 97°, 122°, and 153°. This third-order supercardioid 
has almost no sensitivity to the rear half-plane. For situations where the user 
desires information from only one half-plane, the third-order supercardioid mi- 
crophone performs optimally. Finding the roots of (2.34) with the coefficients 
given by (2.1 19) yields the parameters of the canonic expression given in (2.35) 
as 



ai « 


.113, 




0!2 ~ 


.473, 




«3 « 


.346. 


(2.120) 



5.3.5 Equi-Sidelobe Third-Order Differential. Finally, the design of 
equi-sidelobe third-order differential arrays is explored. Like the design ofequi- 
sidelobe second-order differential microphones, a third-order equi-sidelobe ar- 
ray relies on the use of Chebyshev polynomials and the Dolph-Chebyshev an- 
tenna synthesis technique. The basic technique was discussed earlier in Section 
2.3. For a third-order microphone, the Chebyshev polynomial is 

T3(x) - 4:r^ - 3.x. (2.121) 



Using the transformation x = 
leads to 

tto 

«i 

02 

03 



b + a cos 9 and comparing terms with (2.34) 
b{4b'^ - 3) 

Z ’ 

3a(46''^ - 1) 

L 

I2a'^b 
~L' 



(2.122) 
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Figure 2.24 Equi-sidelobe third-order differential microphone for (a) -20 dB and (b) —30 dB 
sidelobes. 



Combining (2.86), (2.87), (2.88), and (2.122) yields the coefficients for the 
equi-sidelobe third-order differential. These results are: 




Figures 2.24(a) and 2.24(b) show the resulting patterns for -20 dB and -30 dB 
sidelobe levels. From (2.90), the nulls for the -20 dB sidelobe levels are at 
62°, 102°, and 153°. The nulls for the -30 dB sidelobe case are at 79°, 111°, 
and 156°. The directivity indices are 11.8 dB and 10.8 dB, respectively. The 
front-to-back ratio for the -20 dB design is 14.8 dB and 25.2 dB for the -30 dB 
design. 

Next, the directivity index and the front-to-back ratio for the equi-sidelobe 
third-order array as a function of sidelobe level is examined. Figure 2.25 shows 
these two quantities for equi-sidelobe levels from -10 dB to - 60 dB. The direc- 
tivity index reaches its maximum at 12.0 dB for a sidelobe level of approximately 
-16 dB. The -16 dB equi-sidelobe design plotted in Fig. 2.26(a), is there- 
fore close to the optimal third-order differential hypercardioid of Fig. 2.23(c). 
The front-to-back ratio reaches a maximum of 37.3 dB at a sidelobe level of 
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Figure 2.25 Directivity index and front-to-back ratio for equi-sidelobe third-order differential 
array designs versus sidelobe level. 



-42.5 dB; the response is plotted in Fig. 2.26(b). For sidelobe levels less than 
-42.5 dB, the mainlobe moves into the rear half-plane. For sidelobe levels 
greater than - 42.5 dB, the zero locations move towards 0 — 0° and as a result 
the beamwidth decreases. 

5.4 HIGHER-ORDER DESIGNS 

Due to sensitivity to electronic noise and microphone matching requirements, 
differential array designs higher that third-order are not practically realizable. 
These arrays will probably never be implemented on anything other than in 
a computer simulation. In fact, the design of higher-order supercardioid and 
hypercardioid differential arrays using the techniques discussed in Sections 4.2 
and 4.3, can become computationally difficult on present computers. 
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Table 2.6 Table of first-order, second-order, and third-order differential microphone array de- 
signs for spherically isotropic acoustic fields. 



microphone type 


DI (dB) 


F(dB) 


Beamwidth 


Null(s) (degs) 


First-order designs 


dipole 


4.8 


0.0 


90° 


90 


cardioid 


4.8 


8.5 


131° 


180 


hypercardioid 


6.0 


8.5 


105° 


109 


supercardioid 


5.7 


11.4 


115° 


125 


Second-order designs 


dipole 


7.0 


0.0 


65° 


90 


cardioid 


7.0 


14.9 


94° 


180 


hypercardioid 


9.5 


8.5 


66° 


73, 134 


supercardioid 


8.3 




O 

O 

00 


104, 144 


-15 dB sidelobe 


9.4 


10.7 




78, 142 


-30 dB sidelobe 


8.1 


18.5 


84° 


109, 152 


min. rear peak 


8.5 


22.4 


0 

O 

oc 


98, 149 


Third-order designs 


dipole 


8.5 


0.0 


54° 


90 


cardioid 


8.5 


21.0 


78° 


180 


hypercardioid 


12.0 


11.2 


O 

00 


55, 100, 145 


supercardioid 


9.9 


37.7 


66° 


97, 122, 153 


—20 dB sidelobe 


11.8 


14.8 


52° 




—30 dB sidelobe 


10.8 


25.2 


O 

O 
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Figure 2.26 Directivity responses for equi-sidelobc third-order differential arrays for (a) max- 
imum directivity index and (b) maximum front-io-back ratio. 



6. SENSITIVITY TO MICROPHONE MISMATCH 
AND NOISE 

There is a signifieant amount of literature on the sensitivity of superdirec- 
tional array design to interelement errors in position, amplitude, and phase 
[7, 10, 5]. Since the array designs discussed in this chapter have interele- 
ment spacings which are much less than the acoustic wavelength, differential 
arrays are indeed superdirectional arrays. Early work in superdirectional for 
supergain arrays involved over-steering a Dolph-Chebyshev array past endfire. 
When the effective interelement spacing becomes much less than the acoustic 
wavelength, the amplitude weighting of the elements oscillate between plus 
and minus, resulting in pattern differencing or differential operation. Curiously 
though, the papers in the field of superdirectional arrays never point out that 
at small spacings the array can be designed as a differential system as given 
by (2.25). A usual comment in the literature is that the design of superdirec- 
tional arrays requires amplitude weighting that is highly frequency dependent. 
For the application of the designs that we are discussing, namely differential 
systems where the wavelength is much larger than the array size, the ampli- 
tude weighting is constant with frequency as long as we do not consider the 
necessary time delay as part of the weighting coefficient. The only frequency 
correction necessary is the compensation of the output of the microphone for 
the high-pass characteristic of the order system. 

One quantity which characterizes the sensitivity of the array to random am- 
plitude and position errors is the sensitivity function introduced by Gilbert and 
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Morgan [7]. The sensitivity function modified by adding a delay parameter r 
is 



K = 



El=Abrnl 



( 2 . 124 ) 



Em=l P’ 

where is the distance from the origin to microphone m, bm are the amplitude 
shading coefficients of a linear array, and is the delay associated with micro- 
phone m. For the differential microphones discussed, the sensitivity function 
reduces to 



K = 



n + 1 



( 2 . 125 ) 



[rim=i 2 sin(/c (d + cr^)/2)]2 ’ 
where d is the microphone spacing. For values of kd 1, (2.125) can be 

n -I- 1 



further reduced to 



K 



[rim=l ■*" 



2’ 



For the n^^-order dipole case, (2.126) reduces to 



Kd 



n -f- 1 
{kd)'^^ 



( 2 . 126 ) 



( 2 . 127 ) 



The ramifications of (2.127) are intuitively appealing. The equation merely 
indicates that the sensitivity to noise and array errors for n*^-orderdifferential 
arrays is inversely proportional to the frequency response of the array. 

The response of an array to perturbations of amplitude, phase, and position 
can be expressed as a function of a common error term 6"^. The validity of 
combining these terms into one quantity hinges on the assumption that these 
errors are small compared to the desired values. The reader is referred to the 
article by Gilbert and Morgan for specific details [7]. 

The error perturbation power response pattern is dependent on the error term 
the actual desired beam pattern, and the sensitivity factor K. The response 
is given by 

EN^,{e)^EfjM + K6‘^. ( 2 . 128 ) 

Typically 8 is very small, and can be controlled by careful design. However, 
even with careful control, we can only hope for 1% tolerances in amplitude and 
position. Therefore, even under the best of circumstances there will be great 
difficulty in realizing a differential array if the value of^f approaches or exceeds 
10,000 (40 dB). A plot of the value of K for various first-order microphones as a 
function of the dimensionless parameter kd, is shown in Figure 2.27(a). We note 
here that of all of the microphone designs discussed, the hypercardioid design 
has the lower K factor than a dipole. This is in apparent contradiction to other 
superdirectional array designs that can be found in the literature [5]. Typically, 
higher directional gain results in a higher the value of K. The reason for the 
apparent contradiction in Fig. 2.27(a) is that the overall gain of the hypercardioid 
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Figure 2.27 Sensitivity as a function of wavelength element-spacing product for, (a) various 
first-order differential microphones, and, (b) first, second, and third-order dipoles. 



is higher than the dipole microphone shown since the delay increases the array 
output. Other first-order designs with lower values of K are possible, but these 
do not exhibit the desired optimum directional patterns. Figure 2.27(b) shows 
the sensitivity function for first, second, and third-order dipoles. It is obvious 
that a higher order differential array has a larger value of K. Also, as the phase 
delay {d/c) between the elements is increased, the upper frequency limit of the 
usable bandwidth is reduced. 

Another problem that is directly related to the sensitivity factor K is the sus- 
ceptibility of the differential system to microphone and electronic preamplifier 
noise. If the noise is approximately independent between microphones, the 
SNR loss will be proportional to the sensitivity factor K. At low frequencies 
and small spacings, the signal to noise ratio can easily become less than 0 dB 
and that is of course not a very useful design. 

As an example, consider the case of a first-order dipole array with an effec- 
tive dipole spacing of 1 cm. Assume that the self-noise of the microphone is 
equal to an equivalent sound pressure level of 35 dB re 20/i Pa, which is typical 
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for available first-order differential microphones. Now, we place this first-order 
differential dipole at 1 meter from a source that generates 65 dB at 500 Hz at 
1 meter (typical of speech levels). The resulting first-order differential micro- 
phone output SNR from Fig. 2.27(b), is only 9 dB. For a second-order array 
with equivalent spacing the SNR would be -12 dB, and for a third-order array, 
-33 dB. Although this example makes differential arrays more than first-order 
look hopeless, there are design solutions that can improve the situation. 

In the design of second-order arrays. West, Sessler, and Kubli [15] used 
baffles around first-order dipoles to increase the second-order differential signal- 
to-noise. The diffraction caused by the baffle effectively increases the dipole 
distance d, by a factor that is proportional to the baffle radius. The diffraction 
is angle and frequency dependent and, if used properly, can be exploited to 
offer superior performance to an equivalent dipole composed of two zero-order 
(omnidirectional) microphones. The use of the baffles discussed in reference 
[15] resulted in a an effective increase in the SNR by approximately 10 dB. 
The benefit of the baffles used by West, Sessler, and Kubli, becomes clear 
by examining (2.127) and noting the aforementioned increase in the effective 
dipole distance. 

Another possible technique to improve both the signal-to-noise ratio and re- 
duce the sensitivity to microphone amplitude and phase mismatch, is to split the 
design into multiple arrays: each covering a specific frequency range. In the 
design of a differential array, the spacing must be kept small compared to the 
acoustic wavelength. Since the acoustic wavelength is inversely proportional 
to the frequency, the desired upper frequency cutoff for an array sets the array 
microphone spacing requirements. If we divide the differential array into fre- 
quency subbands, then the ratio of upper frequency to lower frequency cutoff 
can be reduced and the spacing for each subband can be made much larger than 
the spacing for a full-band differential array. The increase in signal-to-noise 
ratio is proportional to the relative increase in spacing allowed by the use of the 
subband approach. If the desired frequency range is equally divided into M sub- 
bands, the lowest subband SNR increase will be proportional to 201ogio(M). 
The increase in SNR for each increasing frequency subband will diminish until 
the highest subband which will have the same SNR as the full-band system. The 
subband solution does have some cost: the number of array elements must also 
increase. The increase is at least m, and by reuse of the array elements in the 
subband arrays, can be controlled to be less that nm, where n is the differential 
array order. 

Finally, another approach to control microphone self-noise would be to con- 
struct many differential arrays that are very close in position to each other. By 
combining the outputs of many arrays with uncorrelated self-noise, the SNR can 
be effectively enhanced by 10 logio(-^)> where M is the number of individual 
differential arrays. 
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7. CONCLUSIONS 

Information regarding the design and analytical development of optimal dif- 
ferential microphones can not be found in the literature. The purpose this 
chapter was to provide a basis for the design of differential microphone array 
systems. Systems with differential orders greater than three require micro- 
phones and calibration with tolerances and noise levels not yet available with 
current electroacoustic transducers. Higher order systems are also somewhat 
impractical in that the relative gain in directivity is small [O(logio(n))] as the 
order n of the microphone increases. Differential microphone array designs 
are primarily limited by the sensitivity to microphone mismatch and self-noise. 
The results for many of the differential microphone array designs discussed in 
this chapter are summarized in Table 2.6. 
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Abstract With the recent widespread availability of inexpensive DVD players and home 
theater systems, surround sound has become a mainstream consumer technol- 
ogy. The basic recording techniques for live sound events have not changed 
to accommodate this new dimension of sound field playback. More advanced 
analysis of sound fields and forensic capture of spatial sound also require new 
microphone array systems. This chapter describes a new spherical microphone 
array that performs an orthonormal decomposition of the sound pressure field. 
Sufficient order decomposition into these eigenbeams can produce much higher 
spatial resolution than traditional recording systems, thereby enabling more ac- 
curate sound field capture. A general mathematical framework based on these 
eigenbeams forms the basis of a scalable representation that enables one to eas- 
ily compute and analyze the spatial distribution of live or recorded sound fields. 
A 24 element spherical microphone array composed of pressure microphones 
mounted on the surface of a rigid spherical baffle was constructed. Experimental 
results from a real-time implementation show that a theory based on spherical 
harmonic eigenbeams matches measured results. 



Keywords: Microphone Array, Beamforming, Spherical, 3D Sound Recording 



1. INTRODUCTION 

A microphone array typically consists of two units: an arrangement of two 
or more microphones and a beamformer that linearly combines the microphone 
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signals. This combination allows picking up sound signals dependent on their 
direction of propagation. Their advantage over conventional directional micro- 
phones, like a shotgun microphone, is their high flexibility due to the degrees of 
freedom offered by the multitude of elements and the associated beamformer. 
The directional pattern of a microphone array can be varied over a wide range. 
This can be done by changing the beamformer, which typically is implemented 
in software. Therefore no mechanical alteration of the system is needed. 

There are several standard array geometries of which the most common is 
a linear array. An advantage of using a uniformly spaced linear array is its 
simplicity with respect to analysis which is equivalent to FIR filter design. This 
chapter describes a spherical array geometry which has several advantages over 
other geometries: the beampattern can be steered to any direction in 3-D space 
without changing the shape of the pattern and the spherical array allows full 
3-D control of the beampattern. 

An early publication by DuHamel [1] in 1952 described a design approach 
using a spherical harmonic expansion for narrow-band spherical beamforming 
antenna arrays. Later work on spherical antenna arrays can be found in e.g. 
[2, 3, 4]. These papers apply standard beamforming techniques, known for 
example from linear arrays. Perhaps the first quasi-spherical microphone array 
was patented by Craven and Gerzon [5] which consisted of four directional 
microphones placed on a virtual sphere at the corners of a tetrahedron. This 
array allows the recording of zero and first-order spherical harmonics and has 
been used for Ambisonics recordings [6]. More recently there has been a 
growing interest in spherical microphone arrays [7, 8, 9, 10, 11]. 

This chapter presents a general beamformer design approach based on spher- 
ical harmonics. The array consists of pressure sensors that are located on the 
surface of a rigid sphere. It is shown that this arrangement brings several ad- 
vantages: a) the scattering effects of the sphere are rigorously calculable, b) the 
diffraction of the sphere brings a signal-to-noise (SNR) improvement at low 
frequencies and c) due to the diffraction and scattering introduced by the rigid 
sphere, the array is able to pick up all spherical harmonic modes over a wide 
frequency range. 

The goal is to design an array that is capable of recording spherical harmonics 
of third-order or higher. Incorporating higher-order spherical harmonic modes, 
significantly increases the spatial resolution and the degrees of freedom for 
beampattern design. The proposed beamformer consists of two main units: the 
eigenbeamformer, which performs a spatial decomposition of the sound-field 
into spherical harmonics, and the modal-beamformer, which forms the output 
beam by appropriately combining the spherical harmonics. This segmenta- 
tion allows a decoupling of the actual beamformer from the sensor locations. 
It is shown that this structure results in an efficient and elegant beamformer 
architecture. 
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Figure / Definition of the spherical coordinate system. 



Spherical array beamforming can be applied to a wide range of applications 
such as directional sound pick-up, sound field analysis and reconstruction, pas- 
sive acoustic tracking, /orcn^/c beamforming, or room acoustic measurements. 

2. FUNDAMENTAL CONCEPT 

The main design goal is to record the temporal and spatial information of a 
sound-field at the position of the array. According to the Helmholtz equation 
[12] a sound field is uniquely determined if the sound pressure and particle 
velocity are known on a closed surface. Knowledge of these quantities allows 
one to calculate the sound-field inside and outside of this surface as long as 
no sources or obstacles are in the reconstructed volume. Therefore it follows 
that one needs to measure the sound pressure and the particle velocity only on 
a closed surface in order to record all information of the surrounding sound- 
field. To greatly simplify the array construction, only acoustic pressure-sensing 
microphones are integrated into the surface of a rigid sphere are used. Using 
a rigid body has the advantage that the radial particle velocity on its surface is 
zero. This greatly simplifies the general solution since only the sound pressure 
needs to be measured. The spherical shape was chosen to keep the mathematics 
as simple as possible and, from symmetry, put equal weight on all directions. 
Theoretically many other shapes are possible. 

To keep this chapter self-contained, a brief review of the physical principle 
of the spherical array is presented, that is based on a rigid spherical scatterer. 
For more detailed analysis, the reader is referred to e.g. [13]. Figure 3.1 shows 
the notation for the spherical coordinate system used throughout this chapter. 
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A plane-wave impinging at an angle relative to the z-direction can be 
expressed in spherical coordinates as follows [13]: 

G{kr,^,t) = e^(^t+krcosi>) 

00 

= 5^(271-1- l)i”jn(fcr)P„(cost9)e*‘^‘. (3.1) 

n=0 



To keep the exposition compact, the results given in this chapter are restricted 
to plane-wave incidence. If desired, results can be extended to spherical wave 
incidence [13]. In (3.1), represents the spherical Bessel function of the first 
kind of ordern. The letter* is used for the imaginary constant to avoid confusion 
with the spherical Bessel function. is the Legendre function of order n and 
degree rn, and k represents the wave-number which is equal to 2-k/X where A 
is wavelength. 

From (3.1), the sound particle velocity for an impinging plane-wave on a 
spherical surface can be derived using Euler’s equation. If the spherical surface 
is rigid, the sum of the radial velocity of the incoming and the scattered sound 
wave has to be zero on this surface. Using this boundary condition, the re- 
flected sound pressure can be determined and the resulting sound pressure field 
becomes the superposition of the incoming and the scattered sound pressure 
field: 



G{kr, ka, d) = 

V(2n -I- 1)*" (jn{kr) - h^nHkr)] Pn(cosi9). (3.2) 

^0 \ {ka) ) 

The prime denotes the derivative with respect to the argument and a is the radius 
(2) 

of the sphere while h\i ' is the spherical Hankel function of the second kind of 
order n. From this point, the time dependence is omitted for better readability 
of the equations. 

To find a more general expression that gives the sound pressure at a point 
[fj, for an impinging sound wave from direction the following 

Legendre addition theorem is essential [14]: 

P„(cos0)= (3.3) 

^ (n-Fm)! 

m——n ' ' 

where 0 is the angle between the direction of the impinging sound wave 
and the radius vector of the observation point [r.,, (p^]. Substituting (3.3) 
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into (3.2) gives the following solution: 

G{krs,'da,(ps,ka,'d,(p) = 

oo n 

4:TrY^i^bn{ka,krs) (3.4) 

n=0 m=—n 

where the asterisk (•)* denotes the complex conjugate operation. In (3.4), two 
abbreviations, bn and Y^, are introduced: 

bn{ka,kra) = jn{krs) - ^7^r^h^n\krs), (3.5) 

/ir {ka) 

These quantities play a major role in the concept of the spherical array. In the 
following, bn will be referred to as modal strength or modal coefficient, 
represents the spherical harmonic of order n and degree m. One important 
property is their mutual orthonormality: 

/ / Y^{^,(p)Yff^” {id, (p) Slu'd di9 dip = Snn'^mm'- (3-7) 

Jo Jo 

This property makes spherical harmonics very attractive to beamforming. Ba- 
sically any square integrable function defined on the surface of a sphere, e.g. a 
beampattern, can be expanded into a spherical harmonic series. In fact, some 
beamforming work based on spherical harmonics has been done in the past 
[15]. Equation (3.4) represents a key result of this analysis. It allows the repre- 
sentation of a plane-wave sound-field as an expansion of spatially orthonormal 
spherical harmonic components. Since a far-field sound field can be modeled 
as a superposition of multiple plane-waves, (3.4) allows an expansion of any 
far-field sound-field into an orthonormal series of spherical harmonics. In the 
following we will refer to these orthonormal components as modes of order n 
and degree m. 

Based on (3.4), the next section describes the eigenbeamformer which de- 
composes the sound-field into the orthonormal modes while Section 4 describes 
the modal beamformer which combines these modes to realize desired beam- 
patterns. 

3. THE EIGENBEAMFORMER 

As shown in Figure 3.2, the eigenbeamformer can be logically divided into 
to separate cascaded sections. The first step of this two-stage beamforming 
process can be viewed as a preprocessor to the actual forming of the output 
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Figure 3.2 Block diagram of the functional blocks of the spherical array. 



beam which is done in the modal-beamformer second stage. As will be shown 
in Section 4, introducing this preprocessor step results in many advantages for 
the beamforming itself. 

The task of this preprocessor is to transform the microphone signals into an 
orthogonal beam-space. Since these beams are characteristics of the sound- 
field in a similar way as eigenvectors are for a matrix, they are referred to as 
eigenbeams. Hence, the preprocessor is referred to as an eigenbeamformer. 

The decomposition of the sound-field is based on the orthonormal property 
of the spherical harmonics. To begin, assume a rigid sphere covered with 
a continuous pressure sensitive surface and the sensitivity of this surface be 
position-dependent and described by the spherical harmonic The 

output of such a microphone is: 



Fn',mi {•d,ip,ka,krs) 




■ds,ips, ka, 'd, ip)Y^' {'ds,(ps)dn 



bn'{ka,krs)Y.^' (3.8) 



where d ilg represents an integration over the surface of the sphere. This result 
states that the far-field directivity of the microphone has the same directional 
dependance as its surface sensitivity function, namely , The factor 47r*” 
introduces a phase shift and scaling that can be easily compensated and is 
neglected in the further discussions. Another factor is the modal coefficient 
It introduces a frequency dependance that must taken into account and is further 
investigated in Section 3.3. Yet another problem to be solved is the change 
from the continuous microphone aperture as it is used in (3.8) to a sampled 
aperture for the spherical array. This step is necessary since a) a position 
dependent continuous sensitivity would be extremely difficult to manufacture 
and b) a separate sphere for every eigenbeam would be required. The step from 
a continuous to a discrete aperture is described in the next section. 
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3.1 DISCRETE ORTHONORMALITY 

To make a practical array, the continuous aperture needs to be sampled. To 
achieve the same result as obtained in (3.8) the sample positions have to fulfill 
the following discrete orthonormality condition: 

S-i 

Anm 53 V’s)Y„7'(t9s,(Ps) = dnn'^mm'- (3.9) 

s=0 

Note that the subscript s was previously used to identify a point on the spherical 
surface while in (3.9) it enumerates the sensors that are located on the surface. 
It is a difficult task to find a set of sensor locations that fulfill the orthonormality. 
To relax the constraint from orthonormality to orthogonality, a factor Anm is 
introduced. This factor can be further reduced by defining: 

4vr 

An — (3.10) 

The factor 47T becomes necessary since there is no integration over a sphere for 
the discrete case where it becomes a sum over S sensors which are normalized 
by the factor 1/S. Instead of including this re-normalization in an additional 
factor An, this new normalization can be included in the spherical harmonics 
immediately. To avoid unnecessary confusion this approach is not pursued here. 

One sensor arrangement that fulfills the constraint of discrete orthonormality 
up to modes of 4th-order is the center of the 32 faces of a truncated icosahe- 
dron. Another arrangement using only 24 sensors was found, which achieves 
orthogonality up to 3rd-order modes (see Section 9). 

The resulting structure of the eigenbeamformer can be derived from (3.9): 
for a specific eigenbeam the microphone signals are first weighted by 
the sampled values of the corresponding surface sensitivity, Y’^('!9s> <Ps). then 
Corrected by Anm, finally summed. The outputs of this beamformer are 
the eigenbeams. 

Besides the orthonormal constraint, spatial aliasing has to be considered 
when sampling a discrete aperture. Just as sampling a time waveform requires 
a minimum number of samples per time interval in order to be able to recover 
the original signal, sampling the spatial aperture requires a minimum number of 
sample locations to recover the original spatial signal distribution. Since there 
are (A-|- 1)^ spherical harmonics for a spatial resolution of order N (see Section 
3.2), a minimum of [N -1- 1)^ sample locations are required to distinguish the 
spherical harmonics. 

3.2 THE EIGENBEAMS 

The outputs of the eigenbeamformer are a set of orthonormal beam-patterns, 
the eigenbeams. These eigenbeams represent a spatially orthonormal decom- 
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Figure 3.3 Eigenbeams of order 0, 1 , and 2 (degree 0). 



position of the sound-field, as shown in (3.4). A complete set of eigenheams 
contains all spatial information about the original sound-field. 

Figure 3.3 shows some example eigenheams. From (3.6) it can he seen 
that the elevation dependance follows the Legendre function while the azimuth 
dependence has a sine-cosine dependance. The order n determines the number 
of zeros in i?-direction while two times the degree m gives the number of zeros 
in the ^-direction. 

The number of eigenbeams depends on the desired spatial resolution for the 
application. The total number of eigenbeams up to N-th order is (N + 1)^. 
For example, in a directional microphone application the maximum achievable 
directional gain is 20 log jq {N + 1 ) . To obtain a maximum directional gain of 12 
dB for an arbitrary direction, one would need all eigenbeams up to third-order, 
which is 16 eigenbeams. As mentioned before, the number of microphones 
needs to be equal or larger than the number of eigenbeams. The example 
assumes full 3D coverage. If one is interested only in the horizontal spatial 
resolution the number of eigenbeams can be reduced significantly to 2N + 1. 

The beampattern of the eigenbeams is frequency independent. However 
the magnitude shows a dependance according to the modal coefficient. This 
dependence will be analyzed in the next section 

3.3 THE MODAL COEFFICIENTS 

From (3.8), it is seen that the eigenbeams exhibit a frequency dependence 
according to the modal coefficient The magnitude of the frequency response 

for these coefficient is plotted in Fig. 3.4 for various orders n. 

In Fig. 3.4, it can be seen that at very low frequencies the zero-th order 
mode is dominant. For ka = 0.2 (for a sphere of radius 5cm, this would result 
in a frequency of 220 Hz), the first-order mode is down by 20 dB. At higher 
frequencies more modes emerge. The rising slope of the modal coefficients is 
6N dB per octave. Once a mode has reached an adequate level, it can be used 
for further processing. The level depends on the desired SNR of the overall 
system. 
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Figure 3.4 Mode coefficients b„ for different orders. 



To allow subsequent combination of the eigenbeams they should have a flat 
frequency response. This means that the output of the eigenbeamformer should 
be filtered with the inverse frequency response of the modal coefficients. For low 
frequencies this is basically an amplification. Depending on the performance of 
the microphones and associated hardware and the desired SNR, the maximum 
gain needs to be limited to a certain threshold. The result is that each eigenbeam 
has a low frequency cutoff below which the eigenbeam should not be used for 
further processing. 

The sound field around the sphere contains modes of higher orders than the 
array can sample. For example, at 5 kHz the sound-field around a sphere of 
5 cm radius (ka = 4.6) contains spherical harmonics of significant strength up 
to fifth-order. A spherical array with 32 microphones is able to handle spherical 
harmonics up to fourth-order. To enable a wideband response while avoiding 
aliasing, one has to provide a spatial low-pass filter. Such a low-pass filter can be 
implemented using microphones with large membranes or patch-microphones. 
The term patch-microphone refers to microphones that cover a continuous sec- 
tion of the spherical surface as opposed to point microphones. By integrating 
the sound over a large area, higher order modes will be attenuated. Such a 
patch microphone might be built by using pressure sensitive materials that can 
be placed conformingly onto the surface of a sphere or made by combining 
many closely-spaced pressure microphones. 
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Figure 3.5 Illustration of generating a second order hypercardioid pattern. (Note that only the 
directional properties are shown. The magnitude is scaled to unity for the look direction.) 

4. MODAL-BEAMFORMER 

The modal beamformer forms the second stage of the overall array processing 
structure. The name modal beamformer was chosen to emphasize its difference 
to a conventional beamformer: typically the input signals to a beamformer are 
the microphone signals. However, the modal beamformer takes spatial modes, 
the eigenbeams as input signals. This design approach allows a very simple 
and powerful design that consists of the actual beam shaping unit (combining 
unit) and the steering unit. Both units are independent of each other allowing a 
change of the beam-pattern while maintaining the look-direction or maintaining 
the beam-pattern while steering the look-direction. 

4.1 COMBINING UNIT 

The combining unit is a simple weight-and-sum beamformer: 

N 

d{§) = ^CnYn{'d,<p), (3.11) 

n=0 

where the beamformer multiplies each input beam by a factor Cn and sums all 
weighted beams. As an example. Fig. 3.5 shows the generation of a second- 
order hypercardioid pattern steered along the z-axis. It can be shown that the 
weights for a hypercardioid pattern are: cq = 1, Cio = \/3, and C 20 = 
Beams with degree not equal to zero are weighted with zero. 

Using only zero-degree eigenbeams limits the beampattern in this design 
stage to be (^-independent. However, is is shown in the next section that this 
greatly simplifies the steering of the pattern. 

4.2 STEERING UNIT 

From Section 4.1, the beampattern d is obtained according to (3.11). Using 
(3.6), this can be rewritten as: 

N I 

^ Cn y — ^ Pnicosd). (3.12) 

n=0 
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To steer a beampattem towards a desired look-direction [t?o, V^o]. (3-3) can be 
substituted into (3.12) which results in the new coefficients Cnm- 

tnm = • (3.13) 

Equation (3.13) shows that the steering and beam-shaping coefficients are con- 
nected in a multiplicative manner. To simplify the overall structure of the 
modal-beamformer, one fact that can be exploited is that the steering related 
terms are applied to all eigenheams of a given order, while the weighting coef- 
ficient Cfi only needs to he applied on a per order basis. To separate the steering 
and the heam-shaping, we can rewrite (3.11) using (3.13): 

N n n ^ 

= ■ ( 3 . 14 ) 

n=0 m——n V ^ •'* 

combine steer 



5. ROBUSTNESS MEASURE 



An important characteristic of a microphone array is its sensitivity to devia- 
tions from the ideal implementation. These deviations include: a) errors in the 
sensor locations, h) variations in amplitude and phase, and c) sensor self noise. 
A common measure for these non-ideal limitations is the noise sensitivity [16] 
or its inverse, the white-noise-gain (WNG). In this chapter the WNG will be 
used. The WNG is a measure of the robustness of the array, meaning that the 
higher the WNG of an array, the more robust it is against the above mentioned 
errors occurring in practical implementations. It is defined as: 



WNG(o;) = 



|d(t?o,yo,g;)P 



(3.15) 



where d represents the array output for the look-direction [i?o, <fo\ and Hg is the 
array filter for sensor s. The numerator can be interpreted as the signal energy 
at the output of the array, while the denominator is the sensor self noise power. 
The sensor noise is assumed to be independent from senor to sensor. It may not 
be immediately obvious that this measure also quantifies the sensitivity of the 
array to errors in the setup. For more details see reference [16]. 

The goal of this section is to find some general approximations for the WNG 
that allow an estimate of the robustness of the spherical array. To simplify the 
notation without loss of generality, the look-direction is assumed to be in the 
z-direction. 

To find an expression for the spherical array in the numerator of (3.15), the 
array output d can be replaced by (3. 1 1). The array filters Hg in the denominator 




78 Audio Signal Processing 

of (3.15) can be replaced by the following expression: 



N 



Hs{w) - 47T^ o ^n(‘&s,(Pa)- 



to^^bniu,) 



(3.16) 



Substituting dand i/j,into (3.15), the WNG for the spherical array becomes: 



WNG(w) = 



N 



^ (AiYn('&0t V’o) 



S-1 

E 

s=0 



n=0 
N 



E 

n=0 



Cn{(jj)(Xji 






(3.17) 



From (3.17) it can be seen that the WNG depends to a large extent on the 
sampling scheme of the sphere. This makes a general prediction difficult. 
However, the following two special cases are investigated: a) the WNG of an 
individual eigenbeam and b) a super-directional pattern with bn 6n-i. 

For a single eigenbeam the WNG from (3.17) becomes: 



WNG„(w) 



S^\bn{uj)f\Ynii9o,^o)\'^ 



(3.18) 



Again this expression depends on the sensor locations. However, for the eigen- 
beam of order zero a simplified expression can be found: 

WNGo(w) = Slbo(w)p. (3.19) 



From Fig. 3.4, it can be seen that the modal coefficient for the zero-order mode 
equals unity for low frequencies. Therefore we find a WNG of S for the zero- 
order mode, which is the well known result for the maximum WNG achievable 
with an array. Towards higher frequencies ba decreases and so does the WNG. 
This is different from a delay-and-sum beamformer where the maximum WNG 
remains S. The reason for the decrease is that in this analogy we are only 
looking at the zero-order mode. With increasing frequency the sound energy is 
distributed over an increasing number of modes. If only the zero-order mode is 
used, this additional energy in the higher-order modes requires a proportional 
loss in energy in the zero-th order mode resulting in a decrease in the WNG. 

The second special case is a superdirectional pattern for which bn bn-\. 
Using this constraint (3.17) becomes: 



WNG;v(a;) 



6/v(tu) 


2 


En=0Cn>;.(^0,<P0) 


Civ(w)ajv 




Ea=0 ¥’s)l 



(3.20) 
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From (3.20) it can be seen that as long as the modal beamformer only uses 
frequency independent weights Cn, the frequency dependance is determined by 
the modal coefficients According to Fig. 3.4, this is a 6N dB per octave 
slope. Again, this result agrees with the well known behavior of a differential 
array [17]. 

6. BEAMPATTERN DESIGN 

Using the eigenbeams as input signals to the beamformer simplifies the beam- 
pattern design greatly. This section describes two design concepts: first the 
design of an arbitrary beam-pattern and second the design of optimum beam- 
pattern with regards to the directivity index under the constraint of a minimum 
WNG. 

6.1 ARBITRARY BEAMPATTERN DESIGN 

The main design goal is to achieve a desired beampattern d{’d, (f). Thus, one 
needs to find the modal weights Cn- Exploiting the orthonomality property of 
the spherical harmonics, these weights can easily be found according to: 

Cnm = [ d{d,ip)Y^{'d,ip) dfl. (3.21) 

Jn 

Theoretically any beam-pattern can be realized. In practice, however, d is 
limited by the spherical harmonics that are available to the modal beamformer. 
As discussed earlier, the highest order of a practically typical spherical array 
will be around fourth-order. 



6.2 OPTIMUM BEAMPATTERN DESIGN 

This subsection describes a method to compute the coefficients Cn that result 
in a maximum achievable directivity index (DI). A constraint on the WNG is 
included in the optimization. The optimization method adapts the approach 
given by Cox et. al. in [18]. The directivity factor (D) of a microphone is 
defined as the ratio of energy picked up by an omnidirectional microphone to 
the energy picked up by a directive microphone in an isotropic noise field. Both 
microphones must have the same sensitivity towards the look direction. The 
DI is 10 times the lO-base logarithm of the directivity factor D. If a directive 
microphone is used in a spherically isotropic noise field, the DI can be seen 
as the acoustical signal-to-noise (SNR) improvement achieved by the directive 
microphone for signals propagating along the look direction. For an array D 
can be written in matrix notation: 



D(wo) = 



w^GoG^w 
w^Rw ’ 



(3.22) 
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where (•)^ denotes Hermitian transpose. On the right side of the equation the 
frequency dependence is omitted for readability. The vector w contains the 
sensor weights at frequency wq: 

w=[«;o wi ••• ws-i ]^, (3.23) 

where (-)^ denotes transpose of a vector or a matrix. The sensor weights w 
can he expressed in terms of the modal weights Cn as follows: 

w = He. (3.24) 



This is basically (3.11) in matrix notation. The elements ofH are: 

i’^bn{krs,ka)' 



(3.25) 



H is an 5-by-A^ matrix. The vector c contains the spherical harmonic coef- 
ficients Cfi used for the beampattern design. Gq in (3.22) represents a vector 
describing the source array transfer function for the look direction at oiq. For a 
pressure sensor close to a rigid sphere these values can be computed from (3.4). 
The spatial cross-correlation matrix is R. The matrix elements are defined by: 



ipq 



= [ G{'dp,(pp,'d,(p, krp,ka)G{'&g,(pq, 19, ip, krq,ka)*d^. (3.26) 
JQ 



In Section 5, the WNG was defined and can be rewritten in matrix notation as 
follows: 

w^Pw 

WNG = — r ; — . (3.27) 

w"w 

The above equations assume that only the spherical harmonics of degree 0 are 
used for the pattern. If desired, the equations can be rewritten to include other 
spherical harmonics. The goal is now to maximize the D with a constraint on 
the WNG. This is the same as minimizing the following function, where the 
Lagrange multiplier e is used to include the constraint: 



1 _ 1 1 
7 - D '*'^WNG' 



(3.28) 



Following the approach in [18], one obtains the following equation that has to 
be maximized with respect to the coefficient vector c: 



c^H^PHc 
~ c^^H^(R + eI)Hc’ 



(3.29) 



where I is the identity matrix. Equation (3.29) is a generalized eigenvalue 
problem [19]. Since H, R, and I are of full rank, the solution is the eigenvector 
corresponding to: 

max I A ([H^^(R + rI)H] ’(H^PH))}, (3.30) 
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Figure 3.6 Maximum DI, allowing spherical harmonics up to order N, WNG is arbitrary. 

where A(-) means “eigenvalue from.” Unfortunately (3.30) cannot be solved 
for e. One way to find the maximum D for a desired WNG is as follows: 

1. Find the solution to (3.30) for an arbitrary e. 

2. From the resulting vector c compute the WNG. 

3. If the WNG is larger than desired, then start again with Step 1, with a 
smaller e; if the WNG is too small, start again with Step 1, now using a 
larger e. If the resulting WNG matches the desired WNG, the iteration 
is complete. 

Note that the choice of e = 0 results in the maximum achievable DI. On 
the other hand, e — > oo results in a delay-and-sum beamformer. The latter has 
the maximum achievable WNG, since all sensor signals will be summed up 
in phase, yielding the maximum output signal. It can be seen in (3.28) that 
/(c) depends montonically on e. Figure 3.6 shows the maximum DI that can 
be achieved with a 24 element array using spherical harmonics up to third- 
order without a constraint on the WNG. It is well known that the theoretical 
maximum is DImax = 201ogio((V + 1) [17]. In Fig. 3.6, it can be seen that 
there are deviations from the theoretical value at higher frequencies due to 
spatial aliasing. For a spherical array with 24 elements on the surface of the 
sphere with a — 3.75 cm, the maximum usable frequency is about 5 kHz. 
This explains the deviations from the theoretical DI starting just below 5 kHz. 
According to the theoretical limit, the DI for an array using spherical harmonics 
up to third-order cannot exceed 12 dB. For the given array there will be aliasing 
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Figure 7 WNG corresponding to maximum DI from Fig. 3.6. 



Starting around 5 kHz, which means that spherical harmonics of higher orders 
will be included in the beamforming. Therefore a DI larger than 12 dB does 
not violate the theoretical limit! In Fig. 3.7 one finds the WNG corresponding 
to the maximum DI in Fig. 3.6. As it was found in Section 5, as long as the 
pattern is superdirectional, the WNG increases with 6N dB per octave. The 
maximum WNG that can be achieved is about WNGmax = lOlogjQ 5, which 
is for the 24 element array about 14 dB. In Fig. 3.7, one can see that for the 
sphere baffled array the maximum WNG is a bit higher, about 16 dB. Once the 
maximum is reached it decreases. This is due to the fact that the mode number 
in the array pattern is constant. Since the mode magnitude decreases once a 
mode has reached its maximum, the WNG is expected to decrease as soon as 
the highest mode has reached its maximum. For example, the first-order mode 
shows this for / = 2 kHz (compare Fig. 3.4). 

Figure 3.8 shows the maximum DI that can be achieved with a constraint on 
the WNG for a pattern that contains the spherical harmonics up to third-order. 
Here one can see the tradeoff between WNG and DI. The higher the required 
WNG, the lower the maximum DI and vice versa. For a minimum WNG of 
-10 dB one gets aDI of 12 dB above a frequency of about 1.7 kHz. Between 
100 Hz and 1.7 kHz the DI increases from 6 dB to 12 dB. 

Figures 3.9 and 3.10 give the magnitude and phase of the coefficients com- 
puted according to the procedure described above in this section. N was set to 3 
and the minimum required WNG was -10 dB. The coefficients are normalized 
so that the sensitivity for the look direction is unity. 
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Figure 3.8 Maximum Dl with different constraints on the WNG, N = 3. 




Figure 3.9 Magnitude of filter response c„(w) for maximum DI design with N = 3 and 
WNG > -10 dB. 

7. MEASUREMENTS 

A 24 element spherical array whose geometry is given in the Appendix, was 
measured in an anechoic environment. Figures 3.11 to 3.14 show the measured 
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Figure 3. 10 Phase of filter response Cn (w) for maximum D1 design with N = 3 and WNG > 
-10 dB. 



beampattern for the real part of the eigenbeams Yq, Y 2 and y’ 3 ^. These 
eigenbeams were chosen since they are the most important for the resolution 
in the horizontal plane which typically is the preferred plane of operation. The 
beampatterns are shown in two ways: on the left side the familiar polar plot is 
shown for selected frequencies; on the right side the pattern is shown over the 
complete frequency band of operation from 50 Hz to 5 kHz. This allows a better 
visual display of the frequency dependent behavior of the directional properties. 
Since we are only interested in directional properties here, the beampatterns are 
normalized to 0 dB for the look-direction, which is 0° for the eigenbeams of 
Figs. 3.11 to 3.14. The angular resolution for the measurements was 5°. 

It can be seen that the zero and first order beams are almost ideal with only 
slight deviation from the theoretical pattern. The variations over frequency are 
very small. For the second- and third-order patterns, significant deviations exist 
at frequencies below 700 Hz and 1.7 kHz, respectively. This was expected since 
the WNG is very low in these regions. For a beampattern design this implies 
that the second-order eigenbeams should not be used below 700 Hz and the 
third-order eigenbeams not below 1.7 kHz. 

Another measurement was made to show an example output of the modal 
beamformer. An application was used that allows one to vary the directivity (and 
therefore Dl) of the output beam continuously between zero and about 12 dB, 
the maximum for a third-order system. The beampattern for the medium DI 
of about 6 dB and for the maximum DI of about 12 dB was measured. It is 
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Figure 3.11 Beam pattern of eigenbeam with order zero. 





Figure 3.12 Beam pattern of eigenbeam with order one and degree one. 



displayed in Figs. 3.15 and 3.16. To show the ability to steer the array, the 
look-direction is set to 90°. The first pattern is between a first- and second- 
order cardioid. Since the first-order pattern is dominant, the pattern is frequency 
invariant over the complete operating range. The second pattern is close to a 
third-order hypercardioid pattern, which by definition has the highest directivity 
for any given order. It is interesting to see the transition from a first-order pattern 
at low frequencies to a second-order pattern at medium frequencies and a third- 
order pattern at high frequencies. This transition was designed to keep the 
WNG almost constant over frequencies. 
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Figure 3. 13 Beam pattern of eigenbeam with order two and degree two. 
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Figure 3. 14 Beam pattern of eigenbeam with order three and degree three. 



8. SUMMARY 

This chapter descrihes the mathematical framework for a spherical micro- 
phone array that is flush mounted on the surface of a rigid sphere. It was 
shown that the heamforming process can he logically divided into a two-stage 
heamformer. In the first eigenbeamformer stage, the sound-field is decom- 
posed into spatially orthonormal beams which are called eigenheams. In the 
second modal-heamformer stage, the heam-shaping and steering is done hy a 
simple and efficient matrix multiplication operation. This two-stage structure 
results in several advantages, one of which is that the heam-shaping and steer- 
ing becomes decoupled from the microphone array geometry. The inputs to the 
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Figure 3. 15 Beam pattern with medium directivity index (DI k 6 dB). 
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Figure 3.16 Beam pattern with maximum directivity index (Dl « 12 dB). 



modal-beamformer are expressed as the standard spherical harmonics which 
are well known to be orthonormal. This beamformer architecture yields a 
computationally efficient implementation for the modal-beamformer. Another 
advantage of the two-stage beamformer approach is the inherent scalability of 
the design to any desired modal order. The modal-beamformer is now indepen- 
dent of the number of sensors and the actual geometry of the array (although 
the geometry of the array does have to meet certain strict requirements). Also, 
depending on the desired pattern or application, the beamformer has to include 
only some of the harmonics provided by the decomposer. The spherical array 
is suitable for a broad range of applications such as directional sound field pick- 
up, teleconferencing, multi-channel and surround audio, sound-field analysis 
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and synthesis, room acoustic measurement and post recording spatial sound 
field editing. 
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9. APPENDIX A 



Table 3.1 Locations of the sensors of a 32 element spherical array (truncated icosahedron 
scheme). 



sensor No. 




^[°] 


sensor No. 






1 


180 


0 


17 


252 


37.4 


2 


0 


63.4 


18 


-72 


142.6 


3 


72 


63.4 


19 


216 


142.6 


4 


144 


63.4 


20 


144 


142.6 


5 


216 


63.4 


21 


72 


142.6 


6 


-72 


63.4 


22 


0 


142.6 


7 


36 


1 16.6 


23 


36 


79.2 


8 


108 


116.6 


24 


72 


100.8 


9 


180 


1 16.6 


25 


108 


79.2 


10 


252 


116.6 


26 


144 


100.8 


11 


-36 


1 16.6 


27 


180 


79.2 


12 


0 


180 


28 


216 


100.8 


13 


-36 


37.4 


29 


252 


79.2 


14 


36 


37.4 


30 


-72 


100.8 


15 


108 


37.4 


31 


-36 


79.2 


16 


180 


37.4 


32 


0 


100.8 



Table 3.2 Locations of the sensors of a 24 element spherical array (extended icosahedron 
scheme). 



sensor No. 






sensor No. 


<pn 




1 


0 


37.4 


13 


30 


100.8 


2 


60 


37.4 


14 


90 


100.8 


3 


120 


37.4 


15 


150 


100.8 


4 


180 


37.4 


16 


210 


100.8 


5 


240 


37.4 


17 


270 


100.8 


6 


300 


37.4 


18 


330 


100.8 


7 


0 


79.2 


19 


30 


142.6 


8 


60 


79.2 


20 


90 


142.6 


9 


120 


79.2 


21 


150 


142.6 


10 


180 


79.2 


22 


210 


142.6 


11 


240 


79.2 


23 


270 


142.6 


12 


300 


79.2 


24 


330 


142.6 
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Abstract Digital noise reduction processing is used in many telecommunications applica- 
tions to enhance the quality of speech. This investigation focuses on the class of 
single-channel noise reduction methods employing the technique of short-time 
spectral modification, a class that includes the popular method of spectral subtrac- 
tion. The simplicity and relative effectiveness of these subband noise reduction 
methods has resulted in explosive growth in their use for a variety of speech 
communications applications. The most commonly used forms of the short-time 
spectral modification method are discussed, including the Wiener filter, mag- 
nitude subtraction, power subtraction, and generalized parametric subtraction. 
Because of its importance to the subjective performance of any noise reduction 
method, the subject of real-time signal- and noise-level estimation is also re- 
viewed. A low-complexity noise reduction algorithm is also presented and its 
implementation is discussed. 

Keywords: Noise Reduction, Wiener Filtering, Spectral Subtraction, Short-Time Fourier 

Analysis, Subband Filter Banks, Implementation 



1. INTRODUCTION 

Noise enters speech communications systems in many ways. In traditional 
wire-line telephone calls, one or both parties may he speaking within an environ- 
ment having high levels of background noise. Calls made from public telephone 
booths located near roadways, transportation stations, and shopping areas serve 
as examples. Similarly, cellular, or wireless, telephones permit users to place 
calls from virtually any location, and it is common for such communications 
to be degraded by noise of varied origin. In room teleconferencing applica- 
tions, in which the acoustical characteristics of the environment are generally 
assumed to be controlled (quiet), it is not uncommon for heating, ventilation 
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and air-conditioning systems to contribute substantial levels of noise. Noise 
originates not only from acoustical sources, however. Circuit noise, generated 
electrically within the telephone network, is still prevalent throughout the global 
telecommunications system. 

When present at small or moderate additive levels, noise degrades the sub- 
jective quality of speech communications. Listening tests broadly show that 
people grow less tolerant of, and less attentive to, listening material as the signal- 
to-noise (SNR) ratio of the material decreases. This phenomenon is known as 
listener fatigue. When the SNR of speech material is very low, say less than 
10 dB, the intelligibility of speech is affected. 

Even traditionally low levels of noise can present a problem, especially when 
multiple speech channels are combined as in conferencing or bridging. In 
multiparty, or multipoint, teleconferencing, the background noise present at 
the microphone(s) of each point of the conference combines additively at the 
network bridge with the noise processes from all other points. The loudspeaker 
at each location of the conference therefore reproduces the combined sum of the 
noise processes from all other locations. This problem becomes serious as the 
number of conferencing points increases. Consider a three-point conference in 
which the room noise at all locations is stationary and independent with power 
P. Each loudspeaker receives noise from the other two locations, resulting 
in a total received noise power of 2P, or 3 dB greater than that of a two- 
point conference. With N points, each side receives a total noise power that 
is 10 log(iV - 1) dB greater than P. Eor example, in a conference with 10 
participating locations, the received noise power at each point is about 10 dB 
greater than that of the two-party case. Because a 10 dB increase in sound 
power level roughly translates to a doubling of perceived loudness, the noise 
level perceived by each participant is twice as loud as that of the two-party 
case. The benefits of noise reduction processing for cases such as this are 
clearly evident. 

A variety of approaches have been proposed to reduce noise for purposes 
of speech enhancement. Included are: classic (static) Wiener filtering [6]; dy- 
namic comb filtering (see citations in [14]), in which a linear filter is adapted to 
pass only the harmonic components of voiced speech as derived from the pitch 
period; dynamic, linear all-pole and pole-zero modeling of speech (see cita- 
tions in [14]), in which the coefficients of the (noise-free) model are estimated 
from the noisy speech; short-time spectral modification techniques, in which 
the magnitude of the short-time Eourier transform is attenuated at frequencies 
where speech is absent [l]-[5], [ll]-[20], [23]-[25]; and hidden Markov mod- 
eling [21, 22], a technique also employing time-varying models for speech but 
where the evolution of model coefficients is governed by transition probabilities 
associated with model states. 
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Predominately, speech noise reduction systems are used to improve the sub- 
jective quality of speech, lessening the degree to which listener fatigue limits 
the perceived quality of speech communications. Though some work [14, 24], 
has shown that intelligibility improvement is possible through digital noise- 
reduction processing, these results apply to carefully designed listening tests. 

All the above noise reduction methods share the property that they operate 
on a single channel of noisy speech. They are blind techniques, in the sense 
that only the noise-corrupt speech is known to the algorithm. Thus, in order to 
enhance the speech-signal-to-noise ratio, the algorithms must form bootstrap 
estimates of the signal and noise. When multiple channels containing either the 
same noisy speech source or noise source alone are available, a wide range of 
spatial acoustic processing technologies applies. Included are adaptive beam- 
formers and adaptive noise cancelers. Adaptive noise cancellation methods 
are coherent noise reduction processors, exploiting the phase coherency among 
multiple time series channels to cancel the noise, whereas the noise reduction 
methods listed above are incoherent processors. 

This work focuses solely on the class of single-channel noise reduction meth- 
ods employing short-time spectral modification techniques. The simplicity and 
relative effectiveness of these methods has resulted in explosive growth in their 
use for a variety of speech communications applications. Today, noise reduc- 
tion processors appear in a variety of commercial products, including cellular 
telephone handsets; cellular hands-free, in-the-car telephone adjuncts; room 
teleconferencing systems; in-network speech processors, such as bridges and 
echo cancelers; in-home telephone appliances, including speakerphones and 
cordless phones; and hearing aid and protection devices. Frequently, noise 
reduction processors are commonly used in conjunction with other audio and 
speech enhancement devices. In room teleconferencing systems, for exam- 
ple, noise reduction is often combined with acoustic echo cancellation and 
microphone- array processing (beamforming). In summary, the diversity and 
complexity of modern communications systems present ample opportunity to 
apply methods of digital noise reduction processing. 

The organization of this chapter is as follows. We begin in Section 2 with a 
brief review of Wiener filtering because of its fundamental relation to all past 
and modern noise reduction methods that employ spectral modification. Sec- 
tion 3 reviews the technique of short-time Fourier analysis and discusses its use 
in noise reduction. The short-time Wiener filter is discussed, and a variety of 
commonly used variations on the Wiener filter are reviewed. Techniques for 
estimating the signal and noise envelopes are reviewed in Section 4. Last, Sec- 
tion 5 presents a low-complexity implementation of a noise reduction processor 
for speech enhancement. 




94 Audio Signal Processing 



2. WIENER EILTERING 

Consider the problem of recovering a signal s(n) that is corrupted by additive 
noise. Let 

y{n) = s{n) + v{n) ( 4 . 1 ) 

represent the noisy signal and let the power spectrum of the noise source v{n) be 
known or, at least, accurately estimated. When 5(n) and v(n) are uncorrelated 
stationary random processes, the power spectrum of the noise-corrupt signal, 
Py(u>), is simply the sum of the power spectrums of the signal and noise: 

Py{u) - Ps{u) + Pv{uj). ( 4 . 2 ) 

Under these circumstances the power spectrum of the signal is easily recovered 
by exploiting (4.2) to subtract the power spectrum of the noise from that of the 
noisy observation, that is. 



Ps{u) = Py{u) - Pv{oj). 



( 4 . 3 ) 



Though trivial in concept, this fundamental spectral power subtraction relation 
forms the basis for the noise reductions methods discussed throughout this 
chapter. 

Of course, (4.3) only provides recovery of the power spectrum of the random 
process to which sample function s(n) is associated. To estimate s{n) we 
rely upon classical linear estimation theory. The estimate s(n) of s{n) that 
minimizes the mean-squared error ||s(n) — s(n)||^ is given by [6] 



S'w{(jj) — H\/v {(jj)Y (lj) 



( 4 . 4 ) 



where 5'w(to<) is the Fourier transform corresponding to the optimum s(n), 
is the Fourier transform of y(n), and 






PsH 

Ps{uj) 4 - Pv{l^) 



( 4 . 5 ) 



is the (noncausal) Wiener filter frequency response function derived by Norbert 
Wiener many years ago. Thus, the least-mean-square estimate of the signal 
is acquired simply by applying a frequency dependent gain function to the 
spectrum of the noisy signal. Note that by using (4.2) and (4.3) in (4.5) and 
expanding (4.4) we also have 



5w(w) 



Py{tj) - Py{u)) 
Py{Uj) 



y(m). 



( 4 . 6 ) 



Equation (4.6) illustrates a form of the Wiener recovery method that utilizes a 
spectral subtraction operation. 




Subband Noise Reduction Methods 95 



To apply Wiener’s theory we must make several approximations, which in 
practice do not limit the usefulness of Wiener’s result. When Py{oj) and Py(uj) 
are not known they can he estimated from the observed signals. Using the 
principle of ensemble averaging for stationary signals, the power spectrums of 
the signal and noise are given by the expected value of the squared-modulous of 
their respective Fourier transforms. With E{ } denoting expectation, we have 
Ps(tu) = P{|5(o;)p}, Pv{u) — P{|V’(o;)p} and, consequently, Py{u) = 
E {|y’{w)p}. Substituting expected values into (4.2) and (4.3), (4.5) gives 



Pw(w) 



E{\Y{uj)\^}-E{\V{uj)\^} 



(4.7) 



When the ensemble averages themselves are unknown we may go one step 
further. The Fourier transforms of the observed signals can be used as sample 
estimates of the ensemble averages, leading to 



H\v{ui) = 



|y(g;)p - |U(m)p 

|y(u;)P 



(4.8) 



as an estimate of the Wiener filter. This form of the Wiener filter serves as the 
basis for the large majority of spectral-based noise reduction techniques in use 
today. 



3. SPEECH ENHANCEMENT BY SHORT-TIME 
SPECTRAL MODIEICATION 

Wiener filter theory applies to stationary signals and their power spectrums. 
Speech is, of course, not stationary; its spectral content evolves with time. 
Further, although in many applications noise source v{n) is accurately modeled 
as a stationary process, its power spectrum is not known exactly and must be 
estimated. Under these circumstances Wiener’s theory can be applied in a block- 
processing arrangement using short-time Fourier analysis. In the next several 
sections we review the short-time Wiener filter method of noise reduction as 
well as a few of the most commonly used variants. 



3.1 SHORT-TIME EOURIER ANALYSIS AND 
SYNTHESIS 

Any time series x{n), stationary or otherwise, can be represented by its 
short-time Fourier transform (STFT) [5, 8, 9] 

N-l 

X{k, m) = J2 Hn)x{m - k = Q,...,K -I (4.9) 

n=0 

where m is the time index about which the short-time spectrum is computed, kis 
the discrete frequency index, h{n) is an analysis window, A dictates the duration 
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over which the transform is computed, and K is the number of frequency bins at 
which the STFT is computed. For stationary signals the magnitude-squared of 
the STFT provides a sample estimate of the power spectrum of the underlying 
random process. 

A key motivation for using the STFT in speech enhancement applications 
is that there exist synthesis formulae by which a time series can be exactly 
reconstructed from its STFT representation. As a result, if the noise can be 
eliminated from the STFT ofy(n), the signal estimate s{n) can be recovered 
through the appropriate synthesis procedure. A STFT analysis and synthesis 
structure is a type of subband filter bank, and filter bank theory specifies criteria 
under which perfect reconstruction is possible [5]. Proper synthesis of the time 
series from its STFT is key to the performance of any noise reduction method. 
Improper or ad hoc synthesis results in audible artifacts in the reconstructed 
speech time series, artifacts which reduce the quality of the very material for 
which enhancement is desired. 

In typical noise reduction applications N in (4.9) falls within the range 32- 
256 for speech time series sampled at 8 kHz. The number of frequency bins, 
or subband time series channels, K is often within this same range. Analysis 
windows h{n) for subband filter banks can be designed for particular charac- 
teristics [5]. Alternatively, any of the common data windows (e.g., Hamming) 
can be used, though such traditional windows place constraints on filter hank 
structure that must be considered by the designer. 

Filter bank theory is not discussed further in this chapter, though the example 
implementation in Section 5 includes a description of a subband filter bank. The 
reader is directed to [5] for a complete treatment. 



3.2 SHORT-TIME WIENER FILTER 



The STFT representation can be used to define a short-time Wiener filter. 
Replacing the Fourier transforms in (4.8) with their corresponding STFTs yields 
the short-time Wiener filter 



H-w{k,m) = 



\Y{k,m)\^ — \V{k,m)\'^ 
|F(A;,m)|2 



(4.10) 



Replacing the Fourier transforms in (4.4) with their corresponding STFT rep- 
resentations and using (4.10) gives 



5w(fc,m) = 



|y(A:,m)p - |E(fc,m)p 
|y(A;,m)p 



Y{k,m) 



(4.11) 



as the estimate of the S IFT of the desired signal. As discussed above, the 
desired full-band speech time series estimate is recovered from 5w {k, m) by 
using the appropriate synthesis procedure associated with the subband filter 
bank structure being used. 
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The short-time Wiener filter method of noise reduction was studied in [12, 
13,14]. Though the Wiener filter was not the first of the spectral modification 
method to be investigated for noise reduction, its form is basic to nearly all the 
noise reduction methods investigated over the last forty years. 

In any implementation of (4.11), or any other noise reduction method dis- 
cussed herein, Y {k, m) is computed at every time index m while V {k, m) is 
computed only for m for which speech is absent. Otherwise, corruption of 
the noise envelope estimate occurs. When speech is present, V {k, m) must be 
estimated from past samples. This subject is discussed further in Section 4. 

3.3 POWER SUBTRACTION 

An alternative estimate of S{k, m) is arrived at by departing from Wiener’s 
theory. 5(w) can be represented in terms of its magnitude and phase compo- 
nents, namely, 

5(o;) = (4.12) 

Thus, S{io) can be estimated if estimates of its magnitude and phase can be 
found. Consider, first, stationary s{n) and v{n). As above, if the sample 
functions |S(u;)|^and |V'(o;)|^ are used in place of the power spectrums Ps{tjj) 
and Pv{oj), (4.3) becomes 

|5(o;)| 2 = iy{o;)p - lU(o;)p. (4.13) 

The square root of (4.13) therefore provides an estimate of the signal’s mag- 
nitude spectrum. Concerning the signal’s phase, if the signal-to-noise ratio of 
the noisy signal is reasonably high, the phase of the noisy signal, can 

be used in place of </>s{w). Using these magnitude and phase estimates (4.12) 
yields 

5ps(w) = (4.14) 

as the spectrum estimate of the desired signal. Using STFT quantities in place 
of the power spectrums in (4.14) yields 

Spsik, m) = ^lY(k,m)l^ - lV(k,m)j^ (4 J 5 ) 

as the short-time spectrum estimate. The form in (4.14) and (4.15) is referred 
to as the power subtraction method of noise reduction. Power subtraction was 
studied in [4, 12, 13, 14, 17]. 

Note that (4.14) provides a consistent estimate of the magnitude-squared of 
5(o;). Assuming the signal and noise are uncorrelated, we have 

E{\Sps{oj)\^} = E{\Y{u)\^}-E{\Viw)\^} 

= 

= Psi^)- 



(4.16) 
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Like the Wiener filter, the power subtraction method also can be shown 
to be optimal from within the estimation theoretic framework, albeit from a 
different formulation of the problem. Let S{k, m) and V {k, m) be realizations 
of independent, stationary Gaussian random processes. Also, without loss 
of generality, assume S(fc,m) and V{k,m) are real. If a| and Oy are the 
variances of the signal and noise STFTs, the probability density function of the 
observation Y{k,m) given the known signal and noise variances is given by 
[7, 12] 

p{Y{k,m) \al,alr) = -^-Y—Yse . (4.17) 

7T {ai + ai) 

The maximum likelihood estimate of the signal variance is the estimate 
that maximizes the likelihood of the noisy observation occurring. Maximizing 
(4.17) with respect to the signal variance gives 

=Y'^{k,m) - ay (4.18) 

as the best estimate of the signal variance. This estimate suggests the power 
subtraction estimator (4.15) when variances are replaced by sample approxima- 
tions, that is, by the corresponding magnitude-squared SILT quantities. Thus, 
the power subtraction estimator results from the optimum maximum likelihood 
signal-variance estimator. In comparison, the Wiener estimator results from 
the optimum minimum mean-squared error estimate of the signal spectrum, or 
equivalently, the signal time series. 

McAulay and Mulpass [13] show that the time domain dual of (4.17), in 
which y{n) replaces Y {k, m) and the variances are of the signal and noise time 
series, also yields the power subtraction form when maximized with respect to 
the unknown signal variance. 

3.4 MAGNITUDE SUBTRACTION 

Yet another estimate of S{k, m) is suggested by the magnitude-only form of 
(4.3). Consider 

|5(A;,m)| = |y(A;,m)| - \V{k,m)\ (4.19) 

as an estimate of the magnitude S{k,m). Using (4.19) in (4.12), and appending 
the phase of the noisy signal as done in Section 3.3, gives 

SMs{k,m) = [|y(A:,m)| - |U(A;,m)l]e^^^(^'"‘) (4.20) 

as the short-time spectrum estimate of the signal. The form (4.20) is referred to 
as the magnitude subtraction method of noise reduction. This method of noise 
reduction was popularized by Boll [11], but was suggested by Weiss et al. [15] 
and even earlier by Schroeder [1, 2, 4]. 
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3.5 PARAMETRIC WIENER EILTERING 



The Wiener, power subtraction and magnitude subtraction schemes are 
closely related. Several authors have exploited this fact to define a general 
class of noise reduction methods. To see this, first note that the power subtrac- 
tion estimate (4.15) can be rewritten as 



5ps(A:,m) 



Viy(fc,m)p - |E(fc,m)|2 
Hps{k,m)Y{k,m), 



(4.21) 



where 



Hpsik,m) = 



|y(/c,m)p - |y(fc,m)p 



1/2 



(4.22) 



|y(A;,m)|2 

Similarly, the magnitude subtraction estimate (4.20) can be rewritten as 

SMsik,m) = [\Y{k,m)\-\V{k,m)\]e^^y^^^^^ (4.23) 

= Hus{k,m)Y{k,m), 

where 

\YKin)\ ' ‘ ’ 

Both forms result from the generalized estimate 

SG{k,m) = HGik,m)Y{k,m), (4.25) 



where 



HG{k,m) 



/ |y(fc,m)| \ 

V|y(A:,m)|y 




(4.26) 



This has been referred to as parametric Wiener filtering [14] or parametric spec- 
tral subtraction [17]. Power subtraction results from using ( 7 ,/ 3 ) (2,1/2), 

magnitude subtraction from ( 7 ,^ 0 ) = (1)1)> and the Wiener estimate from 

(7,^) = (2,1). 

For general ( 7 ,/?) the parametric form (4.25) has not been demonstrated to 
satisfy optimality criteria, though this fact does not in any way diminish the 
usefulness of the generalized form for choosing noise reduction gain formulae. 

Figure 4.1 shows a plot of the Wiener (4.10), power subtraction (4.22) and 
magnitude subtraction (4.24) noise reduction gain functions as a function of 
a priori SNR. The a priori SNR is the ratio ol/oy. Also, for purposes of 
evaluating each gain function, the magnitude-squared sample spectrums have 
been replaced by their respective variances. The curves show the attenuation of 
each gain function as input SNR decreases. Note that the Wiener- and power- 
subtraction-gain functions provide an attenuation of no greater than 6 dB at an 
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Figure 4. 1 Gain functions for the Wiener (upper-most solid), spectral power subtraction (dash), 
spectral magnitude subtraction (dot), and a posteriori SNR voice activity detection (lower-most 
solid) methods of noise reduction as a function of a priori input signal-to-noisc ratio. 



input SNR of 0 dB. That is, in any subband k in which the signal and noise 
powers are equal, the contribution of that subband to the reconstructed speech 
time series is no less than half its input amplitude. A fourth gain curve appearing 
in the figure, namely, that of a voice-activity-detection-based method, will be 
discussed in Section 5. 

Gain curves such as Fig. 4.1 provide some insight into the nature of a given 
noise reduction algorithm, but provide little indication of the subjective speech 
quality of the resulting noise-reduced signal. More important to the subjective 
quality of a noise reduction algorithm is the manner in which the speech and 
noise envelopes are estimated; this subject is discussed in Section 4. 

3.6 REVIEW AND DISCUSSION 

3.6.1 Schroeder’s Noise Reduction Device. The first use of spectral gain 
modification methods in speech noise reduction is described in a little-known 
U.S. patent issued in 1965 to M. R. Schroeder [l]-[4], who at the time was 
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Figure 4.2 Schroeder’s noise reduction system. After M. R. Schroeder [1,2]. 



working for AT&T Bell Laboratories. A block diagram of Schroeder’s noise 
reduction system is shown in Fig. 4.2. This diagram is modified from its original 
form for this discussion, and incorporates elements presented by Schroeder in 
a related 1968 patent [2] and also by Schroeder’s colleagues in a subsequent 
published work [4]. 

Schroeder’s system was a purely analog implementation of spectral magni- 
tude subtraction. As shown in the figure, a bank of bandpass filters separates 
the noisy signal into K different frequency bands. The bandwidth of each fil- 
ter is about 300 Hz. Ten individual filters therefore cover the 300 to 3300 Hz 
range necessary for telephony grade speech applications. The noise-reduction 
processing performed in each band is identical. First, the output of each filter 
hank is rectified and averaged using a low-pass filter to produce a short-time 
estimate of the noisy speech envelope for the band. The lowpass filter has a 
cutoff of between 0 and 10 Hz. The noisy speech envelope is then subtracted 
from an estimate of the noise-only envelope. To estimate the noise, the noise 
level estimator uses a series of resistors, capacitors and diodes to produce a run- 
ning estimate of the minima of the noisy speech envelope. The decay time of 
this noise estimator is instantaneous while the rise time is very large, on the or- 
der of seconds. Between speech utterances the noisy speech envelope contains 
only noise, and the noise level estimator quickly decays to meet the level of the 
noise. During utterances the noise level estimate changes very little. Thus, the 
output of the subtraction block is an estimate of the noise-free signal envelope 
for the band, or |5(A;,m)| in the current notation. A second rectification is 
performed on the output to accommodate negative results from the difference 
node (negative estimates are simply set to zero). Finally, the noise-free signal 
envelope is used as a multiplier with the unmodified output of the bandpass 
filter for the band, and the result is summed with the results from all bands to 
form the reconstructed full-band time series s(n). 

It is interesting to note that Schroeder’s implementation was a purely analog 
one, employing bandpass filters and rectification and averaging circuitry. Other 
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aspects of the design preceded presentations by later anthors. By rectifying the 
ontpnt of the signal envelope estimator, Schroeder’s system conld correct for 
negative estimates. 

3.6.2 Literature Review. Following Schroeder’s work, it was not nntil 
the mid-1970s that interest in noise rednction systems grew, presnmably becanse 
of the availability of digital compnters and analog processors that conld be 
controlled by digital decision logic. With the nnification of digital mnltirate 
processing theory in the late 1970s and 1980s came the realization that Wiener- 
filter processing, among other operations, conld be accomplished efficiently 
nsing digital snbband architectnres [5, 8, 9]. 

Digital noise rednction processing for speech enhancement was popnlarized 
by the techniqne of spectral snbtraction. This renewed interest appears to have 
been sparked in a 1974 paper by Weiss, Aschkenasy, and Parsons [15]. Their 
paper describes a “spectrnm shaping” method that nsed amplitnde clipping, or 
gating, in filter banks to remove low-level excitation, presnmably noise. A 
few years later. Boll [11], in an often-sited reference, was apparently the first 
to reintrodnce the spectral snbtraction method that Schroeder had identified 
nearly 20 years earlier. Boll was perhaps the first to cast the magnitude subtrac- 
tion method in the framework of digital short-time Fonrier analysis, which had 
earlier been nnder development by, among others, Allen [9] and Protnoff [8]. 
Shortly after, McAnlay and Malpass [13] presented one of the first treatments 
of the spectral snbtraction method from within a framework of optimal estima- 
tion, which inclnded Wiener filter theory. They described a class of spectral 
snbtraction estimators, inclnding power snbtraction and magnitnde snbtraction, 
from within an estimation theoretic framework [7]. Coincident with [13], Lim 
and Oppenheim [14] presented one of the first comprehensive treatments of 
methods of speech enhancement and noise reduction. The spectral subtraction 
methods are discnssed, also within a framework of optimal estimation, and are 
compared to other methods of speech enhancement. 

Lim and Oppenheim recognize Weiss, Aschkenasy, and Parsons as origi- 
nators of the spectral snbtraction techniqne, as Schroeder’s work was likely 
nnknown to them at the time. Also, in 1980, Sondhi, Schmidt, and Rabiner 
[4] pnblished resnlts from a series of implementation stndies that grew from 
Schroeder’s work in the 1960’s. This work was the first pnblished reference 
of Schroeder’s work, other than the patents [1, 2], bnt likely was not widely 
known itself becanse it was pnblished in the Bell System Technical Jonrnal. 

Yet another spectral noise rednction method has been proposed by Ephraim 
and Malah [12]. They derive a related spectral noise rednction method based on 
optimnm short-time Fonrier amplitnde estimation. Differing from a variance 
estimator, that is, power snbtraction, the amplitnde estimator is optimnm in 
the sense of providing the best minimnm mean-sqnared error estimate of the 
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spectral amplitude. The Fourier amplitude estimator converges to the Wiener 
estimator at high input signal-to-noise ratios. 

Recently, noise reduction processing incorporating psychoacoustic percep- 
tual models has been proposed. Tsoukalas and Mourjopoulos [24] present a 
spectral gain modification technique that uses perceptual models to suppress 
only those components of the noise that are above audibility thresholds. These 
thresholds are dynamic, changing with the changing spectral character of the 
speech itself. A reported 40% gain in intelligibility can be achieved when 
precise information about the noise power level is known. 

3.63 Musical Noise. Much of the work in noise reduction in the last 20 
years has been directed toward implementation issues associated with spectral 
subtraction methods, particularly in understanding and eliminating a host of 
processing artifacts commonly referred to as musical noise. Musical noise is a 
processing artifact that has plagued all spectral modification methods. This ar- 
tifact is perceived by many as the sound made by an ensemble of low-amplitude 
tonal components, the frequencies of which are changing rapidly overtime. The 
amplitude of these components is usually small, on the order of the noise power 
itself. First, because the STFTs in (4.26) are computed over short intervals of 
time, and because the noise envelope estimate is made only during periods of 
silence while the noisy signal envelope estimate is always made, the difference 
in (4.26) can actually be negative. In such cases the common approach is set the 
difference to zero or to invert the sign of the difference and use the result. Such 
harsh action creates sudden discontinuities in the trajectories of spectral ampli- 
tudes, and this induces the artifact. Second, artifacts are induced by improper 
synthesis of the full-band time series. Ad hoc “FF l processing” results in filter 
h a nks that do not possess the quality of perfect reconstruction and, moreover, 
cause aliasing of the subband time series in both frequency and time. 

The works of a variety of authors have shown that artifacts such as musical 
noise can be nearly eliminated by taking proper action in chiefly three aspects 
of the processing. First, careful design of the filter bank is required. Second, 
the proper use of time-averaging techniques, in conjunction with appropriate 
decision criteria, is necessary to produce stable estimates and obviate the need 
for rectification following spectral subtraction. Third, the complementary tech- 
nique of augmenting the traditional gain function with a soft-decision, voice- 
activity-detection (VAD) statistic has proven extremely successful. This later 
technique supplants the traditional gain-based noise reduction form in (4.25) 
with 

SG{k,m) = HG{k,m)P{Hi | Y{k,m))Y{k,m), (4.27) 

where Hi denotes the hypothesis that the signal is present in the observation 
and P[Hi 1 Y{k,m)) is the probability that the signal is present conditioned 
on the observation. P(^H\ \ Y {k, m)) acts as a gain function itself. At very low 
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input SNRs, P[Hi \ Y (fc, m)) further suppresses fluctuations in the traditional 
gain function, fluctuations caused by the statistical volatility of the signal and 
noise envelope estimates. 

Boll [11] was one of the first to augment a traditional gain function (magni- 
tude subtraction) with a VAD. Boll’s method integrates the estimated a priori 
SNR' across all frequency bins and uses the resulting sealer in a binary VAD 
which, if satisfied, applies a fixed additional attenuation to the right-hand side 
of (4.25). McAulay and Malpass [13] derived a “soft-decision” VAD from 
within the detection theoretic framework [7]. Unlike a binary VAD, a soft VAD 
varies continuously within (0,1) as a function of Y{k,m). Later, Ephraim 
and Malah [12] incorporated a measure of the “signal presence uncertainty” 
into their Fourier amplitude estimator. Clearly, Hg{') and P(-) in (4.27) can be 
combined into a single gain function; this idea is pursued in the implementation 
presented in Section 5. 

3.6.4 A Word About Phase Augmentation. It is often noted that the 
practice of phase augmentation - that is, using the phase of the noisy signal as 
an estimate of the signal’s phase - is acceptable because the ear is relatively 
insensitive to phase corruption. More is known, however. Within the context of 
noise reduction. Vary [20] has shown that as long as the SNR is at least 6 dB in 
any subband k for which the gain function is near unity, the resulting distortion 
is generally imperceptible. In other words, because the noisy phase contributes 
to S{k,m) only at those k for which the input SNR is positive, the phase of 
S{k,m) itself is essentially noise-free. Additionally, Ephraim and Malah [12] 
show that the noisy phase is a good choice because it has the property of not 
corrupting the envelope of the optimum short-time Fourier amplitude estimator 
that forms the basis of their method. 

4. AVERAGING TECHNIQUES EOR ENVELOPE 
ESTIMATION 

Proper estimation of both the noisy signal envelope and noise envelope is 
paramount to the performance of any noise reduction technique. Improper es- 
timation of either envelope will result in an unacceptably high level of audible 
processing artifacts, such as musical noise. Up to this point, instantaneous en- 
velope quantities have been used in the gain formulae of the noise reduction 
methods discussed. To combat musical noise, however, averaged, or smoothed, 
envelopes are used. In the next few sections a variety of commonly used time 
averaging techniques are reviewed and discussed in the context of noise reduc- 
tion. 
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4.1 MOVING AVERAGE 

One of the simplest ways to improve the stability of the noise estimate is to 
replace it with an arithmetic average computed over its recent past. For each 
time index m, |V(A:,m)| is replace by a smoothed version V{k,m) given by 

.. M-l 

= — ^ \V{k,m-l)\, ( 4 . 28 ) 

^ 1=0 



where M is the number of samples used in the average. The over-bar notation 
in (4.28) shall be used throughout to denote any averaged magnitude quantity, 
regardless of the averaging technique actually used. In the computation of (4.28) 
the reader should note that V {k, m) is simply Y {k, m) under the assumption 
that speech is absent. 

Because of statistical variation in the noisy signal, it is also fortuitous to 
use a smoothed version, Y (k, m), in place of \Y {k, m)| in the gain formulae. 
Though the amount of smoothing necessary depends upon several aspects of the 
implementation, such as the subband filter bank structure, Y {k, m) is generally 
smoothed much less than V{k,m). This is because variations in y(fc,m) and 
V{k,m) induce different artifacts in the signal estimate SGik,m). Referring 
to (4.26), positive fluctuations in \V{k,m)\ can cause the difference in (4.26) 
to be negative, requiring rectification. Consequently, it is beneficial to average 
the noise magnitude over long intervals (assuming stationary noise). Similarly, 
positive fluctuations in |V (A:, m)| resulting from the statistical variability of the 
noise component reduce the effectiveness of noise reduction because HG{k,m) 
is larger for larger |y(/c, m)|. Thus, some smoothing of the noisy speech 
envelope is also beneficial. Excessive smoothing of the noisy speech envelope, 
however, degrades the speech quality of the signal estimate because S{k,m) 
is not stationary. Excessive smoothing of |y(A:, m)|, and therefore HG{k,m), 
disperses HG{k, m) to the point that it is no longer well matched to the speech 
component of the noisy observation Y {k, m). 

Early on, Boll [11] described the use of arithmetic averaging to reduce the 
presence of artifacts. Eor the SIFT filter bank implementation used. Boll ap- 
plied the same sized average to both the noise and noisy speech envelopes (about 
38 ms). McAulay and Malpass [13] also discuss arithmetic averaging. 

4.2 SINGLE-POLE RECURSION 

The arithmetic average requires an M-length history of the data. Eurther, 
each sample in the average receives the same weight, although it is trivial 
to include a tapered weighting window in (4.28) if desired. An alternative 
to arithmetic averaging is recursive averaging. Using a single-pole recursive 
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average the noise envelope estimate becomes 

V{k,m) = aV{k,m - 1} + (1 - a)\V{k,m)\, (4.29) 

where a, 0 < a < 1, is the coefficient of smoothing. Equation (4.29) defines a 
first-order lowpass filter and so the variance of V {k, m) is less than the variance 
of \V{k, m)| itself 

Recursive averaging has been by far the most popular method of averaging 
used in the spectral noise reduction methods. This is due to its simplicity and 
efficiency, requiring only a single memory location for state variable storage. 
Also, because the impulse response corresponding to (29) decays as a”, n > 0, 
the recursive average weights the recent past more heavily than the distant past. 
This characteristic has been found beneficial to noise reduction processing. 
Indeed, the first investigations into the use of averaging techniques employed 
recursive averaging. Schroeder’s method (Fig. 4.2) incorporates an analog 
version of (4.29) in the signal path common to both the noise and noisy signal 
envelope estimators. Sondhi et al. [4] experimented with variations on the 
recursive average for both the power subtraction and magnitude subtraction 
methods. For computing V {k, m), cutoff frequencies of between 10 and 30 Hz, 
or greater, were found to be effective [4]. For estimating F(A:, m), cutoff 
frequencies of between 1 and 10 Hz were sufficient. Other proponents of the 
recursive average include McAulay and Malpass [13], Ephraim and Malah [12], 
and Cappe [16]. 

4.3 TWO-SIDED SINGLE-POLE RECURSION 

An alternative to the classic single-pole recursive filter involves choosing 
a in (4.29) based upon the magnitude of |F(A:, m)| relative to V{k,m — 1). 
Consider the so-called two-sided single-pole recursion in which a in (4.29) is 
given by 

[ ttaj if |F(A:,77i)| > E(A:,m - 1) 
a=\ _ , (4.30) 

[ Od) if |F(/i:,m)t < ^^(^,771 - 1) 

where eta is the “attack” coefficient and etj is the “decay” coefficient. The 
two-sided recursive average employs two different filter response times, de- 
pending on whether the input is increasing or decreasing in magnitude relative 
to the current average. This property can be advantageous. Consider, first, 
the computation of V {k, m). Although it is desirable to update V (fc, m) only 
when speech is absent, it is not always possible to determine when speech is 
present and when it is not. If speech or other transient phenomena are present in 
Y {k, m) and (4.29) is updated, V {k, m) will become corrupt. This problem can 
be reduced by choosing a^, > oca- In this case increases in |E(A:, m)| change 
V {k, m) much less than decreases in |y [k, m)|, and therefore V {k, m) is less 
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perturbed by transient phenomenon that are not components of the stationary 
noise. 

The two-sided single -pole recursion can also be used to update Y{k,m). 
For this purpose it is common to choose < a<j, in which case Y{k,m) 
is more responsive to the sudden onset of speech energy than to the end of an 
utterance when speech energy decays. This characteristic improves the response 
of Hc{k, m) in (4.26) to the onset of speech. 

Etter and Moschytz [17] and Diethorn [19] used the two-sided single-pole 
recursion in the context of noise reduction; it is also used in the implementation 
discussed in Section 5. The technique has it origins in speakerphone technology, 
where it is used for voice activity detection; for example, see [26]. 

As a variation on (4.29)-(4.30), Etter and Moschytz [17] also proposed using 

{ Oia.V{k,m — 1), if {k,m — 1) < |V(A:,m)| 

adV{k,m-l), if auV^(fc,m - 1) > |17(A;,m)l , (4.31) 
\V{k^m)\, otherwise 

where «a > 1 and 0 < era < 1- In subjective listening tests, this so-called two- 
slope limitation filter reportedly performs better than (4.29)-(4.30) for some 
material [17]. 

4.4 NONLINEAR DATA PROCESSING 

To improve the stability of the noise estimate further, Sondhi et al. [4] 
also experimented with a scheme to post-process the noise envelope V{k,m) 
based upon a short-term histogram of its past values. This early rank ordering 
technique provided a means to prune wild points from the noise envelope es- 
timate. Median filtering and other rank-order statistical filtering can be used 
to post-process V{k,m) and Y{k,m) following any of the averaging tech- 
niques described above; see [10] for an early reference on such methods. More 
recently, Plante et al. [25] have described a noise reduction method using re- 
assignment methods to replace envelope estimates that are deemed erroneous. 
In general, nonlinear data processing techniques can provide improved noise 
reduction performance, although the behavior of such methods is sometimes 
difficult to analyze analytically. 

5. EXAMPLE IMPLEMENTATION 

Noise reduction systems need not be complicated to produce acceptable 
results, as the implementation described in this section demonstrates. The 
method described is a variation on that first presented in [19]. 

Figure 4.3 shows a signal-flow diagram of the noise reduction system. The 
algorithm consists of four key processes: subband analysis, envelope estima- 
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Figure 4.3 Noise reduction system based on a posteriori SNR voice activity detection. 



tion, gain computation, and subband synthesis. Each of these components is 
described below. 

5. 1 SUBBAND FILTER BANK ARCHITECTURE 

The subband architecture implements a perfect reconstruction filter bank 
using the uniform discrete Fourier transform (DFT) filter bank method [5]. 
This filter bank is one of a sub-class of so-called polyphase filter banks, but is 
somewhat simpler in computational structure. 

The subband filter bank implements an overlap-add process. At the start of 
each processing epoch, a block of L new time-series samples is shifted into an 
A-sample shift register. Here, L = 16 and N - 64. The shift-register data 
are multiplied by a length-A analysis window (the filter bank’s prototype FIR 
filter) and transformed via an A-point DFT. Each frequency bin output from the 
DFT represents one new complex time-series sample for the subband frequency 
range corresponding to that bin. The subband sampling rate is equal to the full- 
band sampling rate divided by L. The bandwidth of each subband is the ratio 
of the full-band sampling rate to A. Thus, for 8 kHz sampling, the subband 
sampling rate and bandwidth are, respectively, 5(X) Hz and 125 Hz (indicative of 
an oversampled-by-4 filter bank architecture). Following subband analysis, the 
vector of subband time series is presented to the envelope estimators. Next, the 
noise reduction gain is computed. To reconstruct the noise-reduced full-band 
time series, the subband synthesizer first transforms the gain-modified vector 
of subband time series using an inverse DFT. The synthesis window (same as 
analysis window) is applied, and the result is overlapped with, and added to, 
the contents of an A-sample output accumulator. Fast, a block of L processed 
samples is produced at the output of the synthesizer. 

The prototype analysis and synthesis window, input-output block size L and 
transform block size A are chosen to maintain these properties of the filter 
bank: 
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■ no time-domain aliasing at the subband level, 

■ no frequency domain aliasing at the subband level, and 

■ perfect reconstruction (unity transfer funetion) when analysis is followed 
directly by synthesis (no intermediate processing). 

5.2 A-POSTERIORI-SNR VOICE ACTIVITY 
DETECTOR 

The gain function of the noise reduction algorithm is based on the idea of a 
composite gain function and soft-acting voice activity detector as discussed in 
Section 3.6.3. 

5.2.1 Envelope Computations. As shown in Fig. 4.3, the time series 
output from eaeh subband k of the analysis filter bank is used to update es- 
timates of the noisy speech and noise-only envelopes, respectively, Y {k, m) 
and V{k,m). These estimates are generated using the two-sided single-pole 
recursion described in Section 4.3. Specifically, 

Y{k,m) = 0Y(k,m- 1) + (1 - ^)\Y{k,m)l (4.32) 

and 

V{k,m) = aV{k,m - 1) -h (1 - o;)|y(fc,m)l, (4.33) 

where j3 takes on attaek and decay constants of about 1 ms and 10 ms, respec- 
tively, and a takes on attack and decay time constants of about about 4 sec. and 
1 ms. Note that, in comparison with (4.29), (4.33) uses Y {k, m) in place of 
V {k, m). This substitution is possible because the long attack time (4 sec.) of 
the noise envelope estimate is used in place of the logic that would otherwise 
be needed to discern the speech/no- speech condition. This approach further 
simplifies the implementation. 

5.2.2 Gain Computation. The envelope estimates are used to compute 
a gain function that incorporates a type of voiee activity likelihood function. 
This function consists of the a posteriori SNR normalized by the threshold of 
speech activity detection, 7 . Specifically, the noise reduction formula is 

S{k,m) = H{k,m)Y{k,m), (4.34) 

where gain function H[k,m) is given by 

/ ^(fc,m) \ 

\'yV(k,m)) 



H{k,m) = min 



(4.35) 
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Threshold 7 specifies the a posteriori SNR level at which the certainty of speech 
is declared and p, a positive integer, is the gain expansion factor. Typical values 
for the detection threshold fall in the range 5 < 7 < 20, though the (subjec- 
tively) best value depends on the characteristics of the filter bank architecture 
and the time constants used to compute the envelope estimates, among other 
things. The expansion factory controls the rate of decay of the gain function 
for a posteriori SNRs below unity. With p — for example, the gain decays 
linearly with a posteriori SNR. Factor p also governs the amount of noise re- 
duction possible by controlling the lower bound of (4.35); larger p results in a 
smaller lower bound. The min(-) operator insures the gain reaches a value no 
greater than unity. 

Looking at (4.35), subband time series whose a posteriori SNR exceeds the 
speech detection threshold are passed to the synthesis bank with unity gain. 
Subband time series whose a posteriori SNR is less than the threshold are 
passed to the synthesis bank with a gain that is proportional to the SNR raised 
to the power p. 

Note in particular that (4.35) does not involve a spectral subtraction operation. 
This has the benefit of circumventing the problem of a negative argument, as 
occurs with the parametric form in (4.26). A disadvantage of (4.35) is that 
the gain function, and therefore noise reduction level, is bounded below by the 
reciprocal of the detection threshold. That is, as the a priori SNR goes to zero 
we have (forp = 1 ) 



\y{k,m)\ 
l\V{k, m)\ 



\S{k,m) + V{k,m)\ 
y\V{k,m)\ 

y\V{k,m)\ 

I 

7’ 



(4.36) 



For example, with 7 = 10 the system provides no more than 20 dB of noise 
reduction. 

A variation on the above technique incorporates for each subband k both the 
per-band, or narrowband, normalized a posteriori SNR and a A;-wise arithmetic 
average of the a posteriori SNRs from neighboring bands. This narrowband- 
broadband hybrid gain function can provide improved noise reduction perfor- 
mance for wideband speech utterances, such as fricatives. The reader is referred 
to [19] for more information. 
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5.3 EXAMPLE 

The time series and spectrogram data in Figs. 4.4, 4.5, and 4.6 show the 
results of processing a noisy speech sample using the subband noise reduction 
method presented above. For this example 7 = 8 and expansion factor p = 1 , 
resulting in a minimum gain in (4.35) of -18 dB. The lower-most solid line 
in Fig. 4.1 shows the gain function (4.35) in comparison with the Wiener, 
magnitude subtraction and power subtraction gain functions. 

The upper trace in Fig. 4.4 shows a segment of raw (unprocessed) time se- 
ries for a series of short utterances (digit counting) recorded in an automobile 
traveling at highway speeds. The speech was recorded from the microphone 
channel of a wireless-phone handset and later digitized at an 8 kHz sampling 
rate. The lower trace in Fig. 4.4 shows the corresponding noise-reduced time 
series produced by the noise reduction algorithm. Figure 4.5 shows spectro- 
grams corresponding to the time series in Fig. 4.4. The spectrograms show, at 
least visibly, that the noise reduction method introduces no noticeable distor- 
tion. Figure 4.6 shows the averaged power-spectral density of the background 
noise for the raw and noise-reduced time series. These power spectral densities 
were computed from the time series in Fig. 4.4 over the interval 10s — 12s. As 
can be seen, the noise floor of the processed time series is about 18 dB below 
that of the raw time series uniformly across the speech band. 

6. CONCLUSION 

The subject of noise reduction for speech enhancement is a mature one with a 
40-year history in the field of telecommunication. The majority of research has 
focused on the class of noise reduction methods incorporating the technique 
of short-time spectral modification. These methods are based upon subband 
filter bank processing architectures, are relatively simple to implement and can 
provide significant gains to the subjective quality of noisy speech. The earliest 
of these methods was developed in 1960 by researchers at Bell Laboratories. 

Noise reduction processing has its roots in classical Wiener filter theory. 
Reviewed in this chapter were the most commonly used noise reduction formu- 
lations, including the short-time Wiener filter, spectral magnitude subtraction, 
spectral power subtraction, and the generalized parametric Wiener filter. When 
implemented digitally, these methods frequently suffer from the presence of 
processing artifacts, a phenomenon known as musical noise. The origins of 
musical noise were reviewed, as were approaches to combating the problem. 
The subject of speech envelope estimation was presented in detail and sev- 
eral averaging techniques for computing envelope estimates were reviewed. A 
low-complexity noise reduction algorithm was presented and demonstrated by 
example. 
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Figure 4.4 Speech time series for the noise reduction example. Original (top) and noise-reduced 
(bottom) time series. 
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Notes 



1. The estimated instantaneous a priori SNR is the ratio |5(A;, m)p/|K(/:, m)|^. 
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Abstract The first thing that comes in mind when we talk about acoustic echo cancellation 
is adaptive filtering. In this chapter, we discuss a large number of multichannel 
adaptive algorithms, both in time and frequency domains. This discussion will 
be developed in the context of multichannel acoustic echo cancellation where 
we have to identify a multiple-input multiple-output (MIMO) system (e.g., room 
acoustic impulse responses). 
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1. INTRODUCTION 

All today’s teleconferencing systems are hands-free and single-channel 
(meaning that there is only one microphone and one loudspeaker). In the near 
future, we expect that multichannel systems (with at least two loudspeakers and 
at least one microphone) will be available to customers, therefore providing a 
realistic presence that single-channel systems cannot offer. 

In hands-free systems, the coupling between loudspeakers and microphones 
can be very strong and this can generate important echoes that eventually make 
the system completely unstable (e.g., the system starts howling). Therefore, 
multichannel acoustic echo cancelers (MCAECs) are absolutely necessary for 
full-duplex communication [1]. Let P and Q be respectively the numbers of 
loudspeakers and microphones. For a teleconferencing system, the MCAECs 
consist of PQ adaptive filters aiming at identifying PQ echo paths from P 
loudspeakers to Q microphones. This scheme is, in fact, a multiple-input 
multiple-output (MIMO) system. We assume that the teleconferencing system 
is organized between two rooms: the “transmission” and “receiving” rooms. 
The transmission room is sometimes referred to as the far-end and the receiving 
room as the near-end. So each room needs an MCAEC for each microphone. 
Thus, multichannel acoustic echo cancellation consists of a direct identification 
of an unknown linear MIMO system. 

Although conceptually very similar, multichannel acoustic echo cancellation 
(MCAEC) is fundamentally different from traditional mono echo cancellation in 
one respect: a straightforward generalization of the mono echo canceler would 
not only have to track changing echo paths in the receiving room, but also in 
the transmission room\ For example, the canceler would have to reconverge if 
one talker stops talking and another starts talking at a different location in the 
transmission room. There is no adaptive algorithm that can track such a change 
sufficiently fast and this scheme therefore results in poor echo suppression. 
Thus, a generalization of the mono AEC in the multichannel case does not 
result in satisfactory performance. 

The theory explaining the problem of MCAEC was described in [1] and 
[2]. The fundamental problem is that the multiple channels may carry linearly 
related signals which in turn may make the normal equations to be solved by 
the adaptive algorithm singular. This implies that there is no unique solution to 
the equations but an infinite number of solutions, and it can be shown that all 
but the true one depend on the impulse responses of the transmission room. As 
a result, intensive studies have been made of how to handle this properly. It was 
shown in [2] that the only solution to the nonuniqueness problem is to reduce 
the coherence between the different loudspeaker signals, and an efficient low 
complexity method for this purpose was also given. 




Adaptive Algorithms for MIMO Acoustic Echo Cancellation 121 

Lately, attention has been focused on the investigation of other methods that 
decrease the cross-correlation between the channels in order to get well behaved 
estimates of the echo paths [3], [4], [5], [6], [7], [8]. The main problem is how 
to reduce the coherence sufficiently without affecting the stereo perception and 
the sound quality. 

The performance of the MCAEC is more severely affected by the choice of 
the adaptive algorithm than the monophonic counterpart [9], [10]. This is easily 
recognized since the performance of most adaptive algorithms depends on the 
condition number of the input signal covariance matrix. In the multichannel 
case, the condition number is very high; as a result, algorithms such as the least- 
mean-square (LMS) or the normalized LMS (NLMS), which do not take into 
account the cross-correlation among all the input signals, converge very slowly 
to the true solution. It is therefore highly interesting to study multichannel 
adaptive filtering algorithms. 

In this chapter, we develop a general framework for multichannel adaptive 
filters with the purpose to improve their performance in time and frequency 
domains. We also investigate a recently proposed class of adaptive algorithms 
that exploit sparsity of room acoustic impulse responses. These algorithms 
are very interesting both from theoretical and practical standpoints since they 
converge and track much better than the NLMS algorithm for example. 

2. NORMAL EQUATIONS AND IDENTIFICATION 
OF A MIMO SYSTEM 

We first derive the normal equations of a multiple-input multiple-output 
(MIMO) system. 



2.1 NORMAL EQUATIONS 

We assume that we have a MIMO system with P inputs (loudspeakers) and 
Q outputs (microphones). We also assume that the MIMO system (a room in 
our context) is linear and time-invariant. Acoustic echo cancellation consists 
of identifying P echo paths at each microphone so that in total, PQ echo paths 
need to be estimated. We have Q output (microphone) signals (see Fig. 5.1): 

P 

+ bq(n), (5.1) 

p=i 

q = 1,2, 



where superscript ^ denotes transpose of a vector or a matrix. 



‘PQ 



— ^pq,0 ^pq,\ 



hpq,L-\ ] 



T 



is the echo path - of length L - between loudspeaker p and microphone q, 
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bi(n) 




Figure 5.1 A MIMO system consisting of P inputs and Q outputs. 



Xp(n) = [ a;p(n) Xp{n - 1) ■■■ Xp(n-L + 1)]^, 

P = 

is the pth reference (loudspeaker) signal (also called the far-end speech), and 
bq (n) is the near-end noise added at microphone q, assumed to be uncorrelated 
with the far-end speech. We define the error signal at time n for microphone q 
as 

eq[n) = yq{n) - yq{n) 

P ^ 

P=1 




are the model filters. It is more convenient to define an error signal vector for 
all the microphones: 



e(n) = y(n) - y(n) 

= y(n) - H^x(n), 



(5.3) 
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Having written the error signal, we now define the recursive least-squares 
error criterion with respect to the modelling fdters: 



J(n) = ^ A”-*e^(i)e(z) (5.4) 

q=l i=0 

Q 

9=1 

where A(0<A<l)isa forgetting factor. The minimization of (5.4) leads to 
the multichannel normal equations: 



(5.5) 
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where 



Rxx(n) = XI *x(i)x^(z) 

i-0 

Rii(n) Ri 2 (n) Rip(n) 

R-2i(n) R22(«) ••• R2 p(«) 

(5.6) 

Rpi(n) Rp 2 (n) Rpp(n) 

is an estimate of the input signal covariance matrix - of size (PL x PL), and 

n 

Rx,(n) = X 
i=0 

is an estimate of the cross-correlation matrix - of size (PL x Q) - between 
x(n) and y^(n). 

It can easily he seen that the multichannel normal equations (5.5) can he 
decomposed in Q independent normal equations, each one corresponding to a 
microphone signal: 

Rxx(n)h<,(n) = rxy,q(n), q=l,2,...,Q, (5.8) 

where h^(n) [resp. Vxy^q(n)] is the q\h column ofmatrix H(n) [resp. Ra;y(n)]. 
This result implies that minimizing J(n) or minimizing each Jq(n) indepen- 
dently gives the same results. This makes sense from an identification point of 
view, since the identification of the impulse responses for one microphone is 
completely independent of the others. 

2.2 THE NONUNIQUENESS PROBLEM 

In many situations, the signals Xp(n) are generated from a unique source 
s(n), so that: 

Xpi'^) = gj's(n), p = 1,2,...,F, (5.9) 

where 

Sp ~ [ 9p,0 9p,l ■■■ 9p,L-\ ] 

is the impulse response between the source and microphone p in the transmis sion 
room in the case of a teleconferencing system [2]. Therefore the signals Xp(n) 
are linearly related and we have the following [P(P - l)/2] relations [2]: 

Xp(n)gi = xf(n)gp, 

i,p = 1,2, i ^ p. 




(5.10) 
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Indeed, since Xp = s * Qp, therefore Xp * Qi — s * Qp * gi = X{ * Qp (the symbol 
is the linear convolution operator). Now, consider the following vector: 



u = 






<ptj 



where Cp are arbitrary factors. We can verify using (5.10) that Rj. 3 -(n)u = 
l» so is not invertible. Vector u represents the nullspace of matrix 

The dimension of this nullspace depends of the number of inputs and 
is equal to (P - 2)L + 1 (for P > 2). So the problem becomes worse as 
P increases. Thus, there is no unique solution to the problem and an adap- 
tive algorithm will drift to any one of many possible solutions, which can be 
very different from the “true” desired solution = hpg. These nonunique 
“solutions” are dependent on the impulse responses in the transmission room: 



P 

hi, = hi,+^5]Cpgp, (5.11) 

p=2 

hpq = hp, — /5Cp8i) P — 2, ..., P, (5.12) 

where /3 is an arbitrary factor. This, of course, is intolerable because gp can 
change instantaneously, for example, as one person stops talking and another 
starts [1], [2]. 



2.3 THE IMPULSE RESPONSE TAIL EEEECT 



We first define an important measure that is very useful for MCAEC. 
Definition: The quantity 



IIMl 



q — 1,2, ..., Q, 



(5.13) 



where || . || denotes the two-norm vector, is called the normalized misalignment 
and measures the mismatch between the impulse responses of the receiving 
room and the modelling filters. In the multichannel case, it is possible to have 
good echo cancellation even when the misalignment is large. However, in such 
a case, the cancellation will degrade if the gp change. A main objective of 
MCAEC research is to avoid this problem. 

Actually, for the practical case when the length of the adaptive filters is 
smaller than the length of the impulse responses in the transmission room, there 
is a unique solution to the normal equation, although the covariance matrix is 
very ill-conditioned. 

On the other hand, we can easily show by using the classical normal equa- 
tions that if the length of the adaptive filters is smaller than the length of the 
impulse responses in the receiving room, we introduce an important bias in the 
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coefficients of these filters because of the strong cross-correlation between the 
input signals and the large condition number of the covariance matrix [2]. So 
in practice, we may have poor misalignment even if there is a unique solution 
to the normal equations. 

The only way to decrease the misalignment is to partially decorrelate two- 
by-two the P input (loudspeaker) signals. Next, we summarize a number of 
approaches that have been developed recently for reducing the cross-correlation. 



2.4 SOME DIFFERENT SOLUTIONS FOR 
DECORRELATION 



If we have P different channels, we need to decorrelate them partially and 
mutually. In the following, we show how to partially decorrelate two channels. 
The same process should be applied for all the channels. It is well-known 
that the coherence magnitude between two processes is equal to 1 if and only 
if they are linearly related. In order to weaken this relation, some non-linear 
or time-varying transformation of the stereo channels has to be made. Such 
a transformation reduces the coherence and hence the condition number of 
the covariance matrix, thereby improving the misalignment. However, the 
transformation has to be performed cautiously so that it is inaudible and has no 
effect on stereo perception. 

A simple nonlinear method that gives good performance uses a half-wave 
rectifier [2], so that the nonlinearly transformed signal becomes 



// N / N , Xp{n) + \xp{n)\ 

Xp[n) = Xp{n) -f , 



(5.14) 



where a is a parameter used to control the amount of nonlinearity. For this 
method, there can only be a linear relation between the nonlinearly transformed 
channels if Vn, xi (n) >0anda;2(n) > 0 or if we haveaxi (n-ri) = a; 2 (ti— T 2 ) 
with a > 0. In practice however, these cases never occur because we always 
have zero-mean signals and g^, are in practice never related by just a simple 
delay. 

An improved version of this technique is to use positive and negative half- 
wave rectifiers on each channel respectively. 



x\{n) = xi{n) + a 
= X 2 {n) + a 



xi{n) + |xi(n)| 

2 

X2{n) - \x2{n)\ 



(5.15) 

(5.16) 



This principle removes the linear relation even in the special signal cases given 
above. 

Experiments show that stereo perception is not affected by the above meth- 
ods even with a as large as 0.5. Also, the distortion introduced for speech is 
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hardly audible because of the nature of the speech signal and psychoacoustic 
masking effects [11]. This is explained by the following three reasons. First, the 
distorted signal Xp(n) depends only on the instantaneous value of the original 
signal Xp{n) so that during periods of silence, no distortion is added. Second, 
the periodicity remains unchanged. Third, for voiced sounds, the harmonic 
structure of the signal induces “self-masking” of the harmonic distortion com- 
ponents. This kind of distortion is also acceptable for some music signals but 
may be objectionable for pure tones. 

Other types of nonlinearities for decorrelating speech signals have also been 
investigated and compared [12]. The results indicate that, of the several non- 
linearities considered, ideal half-wave rectification and smoothed half-wave 
rectification appear to be the best choices for speech. For music, the nonlinear- 
ity parameter of the ideal rectifier must be readjusted. The smoothed rectifier 
does not require this readjustment but is a little more complicated to implement. 

In [6] a similar approach with non-linearities is proposed. The idea is ex- 
panded so that four adaptive filters operate on different non-linearly processed 
signals to estimate the echo paths. These non-linearities are chosen such that the 
input signals of two of the adaptive filters are independent, which thus represent 
a “perfecf’ decorrelation. Tap estimates are then copied to a fixed two-channel 
filter which performs the echo cancellation with the unprocessed signals. The 
advantage of this method is that the NLMS algorithm could be used instead of 
more sophisticated algorithms. 

Another approach that makes it possible to use the NLMS algorithm is to 
decorrelate the channels by means of complementary comb filtering [1], [13]. 
The technique is based on removing the energy in a certain frequency band of 
the speech signal in one channel. This means the coherence would become zero 
in this band and thereby results in fast alignment of the estimate even when using 
the NLMS algorithm. Energy is removed complementarity between the chan- 
nels so that the stereo perception is not severely affected for frequencies above 
1 kHz. However, this method must be combined with some other decorrelation 
technique for lower frequencies [14]. 

Two methods based on introducing time-varying filters in the transmission 
path were presented in [7], [8]. In [7], left and right signals are filtered 
through two independent time-varying first-order all-pass filters. Stochastic 
time-variation is introduced by making the pole position of the filter a random 
walk process. The actual position is limited by the constraints of stability and 
inaudibility of the introduced distortion. While significant reduction in cor- 
relation can be achieved for higher frequencies with the imposed constraints, 
the lower frequencies are still fairly unaffected by the time-variation. In [8], a 
periodically varying filter is applied to one channel so that the signal is either 
delayed by one sample or passed through without delay. A transition zone be- 
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tween the delayed and non-delayed state is also employed in order to reduce 
audible discontinuities. This method may also affect the stereo perception. 

Although the idea of adding independent perceptually shaped noise to the 
channels was mentioned in [1], [2], thorough investigations of the actual benefit 
of the technique was not presented. Results regarding variants of this idea can 
be found in [4], [5]. A pre-processing unit estimating the masking threshold 
and adding an appropriate amount of noise was proposed in [5]. It was also 
noted that adding a masked noise to each channel may affect the spatialization 
of the sound even if the noise is inaudible at each channel separately. This 
effect can be controlled through correction of the masking threshold when 
appropriate. In [4], the improvement of misalignment was studied in the SAEC 
when a perceptual audio coder was added in the transmission path. Reduced 
correlation between the channels was shown by means of coherence analysis, 
and improved convergence rate of the adaptive algorithm was observed. A 
low-complexity method for achieving additional decorrelation by modifying 
the decoder was also proposed. The encoder cannot quantize every single 
frequency band optimally due to rate constraints. This has the effect that there 
is a margin on the masking threshold which can be exploited. In the presented 
method, the masking threshold is estimated from the modified discrete cosine 
transform (MDCT) coefficients delivered by the encoder, and an appropriate 
inaudible amount of decorrelating noise is added to the signals. 

In the rest of this chapter, we suppose that one of the previous decorrelation 
methods is used so the normal equations have a unique solution. However, the 
input signals can still be highly correlated, therefore requiring special treatment. 

3. THE CLASSICAL AND FACTORIZED 
MULTICHANNEL RLS 

From the normal equations (5.8), we easily derived the classical update equa- 
tions for the multichannel recursive least-squares (RLS): 

eq{n) = y,(n) - h^(n - l)x(n), (5.17) 

h,(n) = h,(n - 1) 4-R~j(n)x(n)e,(n). (5.18) 

Note that the Kalman gain k(n) = R~^^(n)x(n) is the same for all the micro- 
phone signals q - 1,2,..., Q. This is important, even though we have Q update 
equations, the Kalman vector needs to be computed only one time per iteration. 
Using the matrix inversion lemma, we obtain the following recursive equation 
for the inverse of the covariance matrix: 

- 1 ) 

- l)x(n)x^(n)R-J(n - 1) 

1 -f A“ix^(n)Rxx(n - l)x(n) 



(5.19) 
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Another way to write the multichannel RLS is to first factorize the covariance 
matrix inverse Rrri"). 

Consider the following variables: 

P 

Zp(n) = J]]CpjXj(n) 
f=i 

p 

= xp(n)+ CpjXj(n) 

= Xp(n) - Xp(n), p = 1 , P, (5.20) 

withCpp = IlxL and Xp(n) = - Z)jlij^pCpjXj(n). Matrices Cpj are the 
cross-interpolators obtained by minimizing 



n 

i=0 

and Zp(n) are the cross-interpolation error vectors. 

A general factorization ofR~^(n) can be stated as follows: 
Lemma 1: 



■ Kf\n) 


OlxL 


1 — 

X 

e 


OlxL 


R2-i(n) 


OlxL 


^LxL 


OlxL 


■ ■ ■ Rp^ (n) 


^LxL 


Ci2(n) 


••• Cip(n) 


C 2 i(n) 


Iz-xL 


■ • • C2p(n) 


. Cpi(n) 


Cp2{n) 


IlxL 



where 

P 

R-p(t^) — ^ ^ C<pj(Tt)Rjp(w), p — 1,2,...,/^. 



(5.21) 



(5.22) 



(5.23) 



Proof: The proof is rather straightforward by multiply ing both sides of (5.22) 
by Rxx(n) and showing that the result of the right-hand side is equal to the 
identity matrix with the help of (5.21). 

Example: P - 2. \n this case, we have: 



zi(n) = xi(n) -h Ci2X2(n), 

Z2(n) = X2(n) -f C2ixi(n), 



(5-24) 

(5.25) 




1 30 Audio Signal Processing 



where 



Ci2(n) = -Ri2(n)R2-;(n), (5.26) 

C2i(n) = -R2i(n)Rri^n), (5.27) 



are the cross-interpolators obtained by minimizing A" and 

EiLo (0z2(0- Hence: 



R-i(n) 



where 



R.i^(n) OixL 1^ 

Oixh R-2^^) 

hxL -Ri2(n)R22\n) 

-R2i(n)Rfi^(n) IlxL 



Ri(n) = Rii(n) - Ri 2 (n)R 22 ^(n)R 2 i(n), 
R 2 (n) == R 22 (n) - R 21 (n)Rf/ (n)Ri 2 (n) , 



(5.28) 



(5.29) 

(5.30) 



are the cross-interpolation error energy matrices or the Schur complements of 
Rxa:(ti) with respect to R 22 (n) and Rn(n). 

From the above result (Lemma 1), we deduce the factorized multichannel 
RLS: 



hpg(n) - hpg(n - 1) + Rp\n)zp(n)eg(n), (5.31) 

p = 1,2,...,P, q = 1,2,...,Q. 

4. THE MULTICHANNEL EAST RLS 

Because RLS has so far proven to perform better than other algorithms in the 
MCAEC application [15] a fast calculation scheme of a multichannel version 
is presented in this section. Compared to standard RLS it has a much lower 
complexity, QP^LA-2PL multiplications [instead of 0{P^L^)]fov one system 
output. This algorithm is a numerically stabilized version of the algorithm 
proposed in [16]. Some extra stability control has to be added so that the 
algorithm behaves well for a non-stationary speech signal. The following has 
to be defined: 



Xin) 


= [xi(n) X 2 {n) ■ ■ ■ a;p(n)]^, (P x 1), 


(5.32) 


x(n) 


= [X^{n) X^{n - 1) • • • X^\n - L + 1)]^', 


(5.33) 




(PL X 1), 




hgin) 


= [^lg,o(’^) ^ 29 ,o(tl) ••• h(^P-l)q,L-l{n) hpg^L- 


-i(n)F, 




{PL X 1). 


(5.34) 
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Note that the channels of the filter and state-vector [x(n)] are interleaved in this 
algorithm. Defined also: 

■ A(n), B(n) = Forward and backward prediction filter matrices, {PL x 

P), 

■ EA(n), EB(t^) = Forward and backward prediction error energy matri- 
ces, {P X P), 

■ ej^{n), eB(ti) ^Forward and backward prediction error vectors, (Fxl), 

■ k'(n) = R-xi (n — l)x(n) = a priori Kalman vector, (PL x 1), 

■ (p(n) = Maximum likelihood related variable, (1 x 1), 

■ K 6 [1.5, 2.5], Stabilization parameter, (1 x 1), 

■ A € (0, 1], Forgetting factor, (1 x 1). 

The multichannel fast RLS (FRLS) is then: 



6 a (n) 
(pi(n) 
t(n) 
m(n) 

EA(n) 

A(n) 

eBi(n) 
eBa(n) 
6b (n) 
k'(n) 
(p{n) 
EB(n) 
B(n) 

eg{n) 

h,(n) 



Prediction : 

X(n) - A^(n - l)x(n - 1), (P x 1), 

(p(n - 1) -h eX(n)E;[^(n - l)eA(n), (1 x 1), 



Opxl 
k'(n - 1) 



+ 



IpxP 

— A(n — 1) 



E^^Cn - l)eA(n), 



((PL-FP) xP), 

A[EA(n - 1) + eA(n)eX(n)/(p(n - 1)], (P x P), 
A(n - 1) + k'(n — l)e5[(n)/ip(n - 1), {PL x P), 



EB(n-l)m(n), (P x 1), 

X(n - L) - B^(n - l)x(n), (P x 1), 



«eB2(n) -F (1 - K)eBi(n), (P x 1), 



t(n) -F B(n — l)m(n), [PL x 1), 

^pl{n) - eQ^{n)m{n), (1x1), 

A[Eb ( n - 1) + 6 b 2 {n)el^ (n)/(,o(n)], (P x P), 
B(n — 1) + k'{n)e^(n}/(p(n), (PL x P). 



Filtering : 

yq{n) (1x1), 

hq(n - 1) + k'(n)eq(n)/(p(n), (PL x 1). 
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5. THE MULTICHANNEL LMS ALGORITHM 

We derive two different versions of the multichannel LMS algorithm. The 
first one is straightforward and is a simple generalization of the single-channel 
LMS. The second one is more sophisticated and takes into account the cross- 
correlation among all the channels. 

5.1 CLASSICAL DERIVATION 

The mean-square error criterion is defined as 

Jus,q = E yg{n) - {n)hq (5.35) 

where E{) denotes mathematical expectation. Let f(hQ) denote the value 
of the gradient vector with respect to h^. According to the steepest-descent 
method, the updated value of hg at time n is computed by using the simple 
recursive relation [17]: 

hq{n) = hg(n - 1) -h ^ |-f [h,(n - 1) I , (5.36) 

where /j, is positive step-size constant. Differentiating (5.35) with respect to the 
filter, we get the following value for the gradient vector: 

f(h,) = [f^ch,) ... 

= dJ^s^qfdiiq = ~2Vxy^q 2T{.xx^qi (5.37) 

with Txy,q = E{yq{n)x{n)} and = E {x{n)x^ (n)} . By taking f(h,) = 
OlpxI, we obtain the Wiener-Hopf equations 

= ^xy,qi (5.38) 

which are similar to the normal equations (5.8) that were derived from a 
weighted least-squares criterion (5.4). Note that we use the same notation 
for similar variables that are derived either from the Wiener-Hopf equations or 
the normal equations. 

The steepest-descent algorithm is now: 

hq(n) = hq{n - 1) -I- fiE{x{n)eq{n)}, (5.39) 

and the classical stochastic approximation (consisting of approximating the 
gradient with its instantaneous value) [17] provides the multichannel LMS al- 
gorithm: 



hq{n) = hq{n - 1) -f /ix(n)eg(n), 



(5.40) 
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of which the classical mean weight convergence condition under appropriate 
independence assumption is: 



0 < < 



LE 






(5.41) 



where the (p = 1) 2, P) are the powers of the input signals. When this 
condition is satisfied, the weight vector converges in the mean to the optimal 
Wiener-Hopf solution. 

However, the gradient vector corresponding to the filter pq is: 



fp(h,) = -2 r 



PQ 



y~! 

j=i 



p = 1,2,...,P, 



(5.42) 



which clearly shows some dependency of fp on the full vector hg. In other 
words, the filters hjg with j ^ p influence, in a had direction, the gradient 
vector fp when seeking the minimum, because the algorithm does not take the 
cross-correlation among all the inputs into account. 



5.2 IMPROVED VERSION 

We have seen that during the convergence of the multichannel LMS algo- 
rithm, each adaptive filter depends of the others. This dependency must be 
taken into account. By using this information and Lemma 1, we now differen- 
tiate criterion (5.35) with respect to the tap-weight in a different way. The new 
gradient is obtained by writing that hpg depends of the full vector h^. We get: 

fp(hp,) = (5.43) 

Ohpg 

= -2E |zp(n) [ 2 /g(n) - x^’(n)hgj | , p = 1, 2, ..., P, 

with 

P 

J = 1 
P 

= CpfXf («), P - 2, ..., P. (5.44) 

f=i 

We have some interesting orthogonality and decorrelation properties. 
Lemma 2: 

E{y^{n)zj{n)} = 0, (5.45) 

£l{zp(n)xj(n)} = Vp, j = 1,2,...,P, p ^ j. (5.46) 
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Proof: The proof is straightforward from Lemma 1 (using mathematical 
expectation instead of weighted least-squares). 

We can verify by using Lemma 2 that each gradient vector fp (p = 1,2,...,P) 
now depends only of the corresponding filter hpq. In other words, we make the 
convergence of each hpq independent of the others, which is not the case in the 
classical gradient algorithm. 

Based on the above gradient vector, the improved steepest-descent algorithm 
is easily obtained, out of which a stochastic approximation leads to the improved 
multichannel LMS algorithm: 

hq(n) = hq(n - 1) + fxz{n)eq{n) (5.47) 



with 



z(n) = [ zj (n) zj (n) 



z'^(n) 



and 



0 < < 






at 



(5.48) 



to guaranty the convergence of the algorithm. Note that the improved mul- 
tichannel LMS algorithm can be seen as an approximation of the factorized 
multichannel RLS algorithm by taking R~ ^ (n) ks 



6. THE MULTICHANNEL APA 

The affine projection algorithm (APA) [18] has become popular because of 
its lower complexity compared to RLS while it converges almost as fast in the 
single-channel case. Therefore it is interesting to derive and study the multi- 
channel version of this algorithm. Like the multichannel LMS, two versions 
are derived. 



6.1 THE STRAIGHTEORWARD MULTICHANNEL 
APA 

A simple trick for obtaining the single-channel APA is to search for an algo- 
rithm of the stochastic gradient type cancelling N a posteriori errors [19]. This 
requirement results in an underdetermined set of linear equations of which the 
mininum-norm solution is chosen. In the following, this technique is extended 
to the multichannel case [20]. 

By definition, the set of A a priori errors and N a posteriori errors are: 

= yq(ri) - x' (n)hq(n - 1), 

^s.,q{n) = yq(n) - X'^ (n)hq(n), 



(5.49) 

(5.50) 
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where 

X(n)=[X[(n) Xl(n) ••• X?(n) 

is a matrix of size PL x N\ the L x N matrix 

Xp(n) = [ Xp(n) Xp(n-l) ••• Xp(n - iV + 1) ] 

is made from the N last input vectors Xp(n); finally, yq{n) and eq{n) are 
respectively vectors of the N last samples of the reference signal yq{n) and 
error signal 6q (n) . 

Using (5.49) and (5.50) plus the requirement that 6a, = 0/vxi, we obtain: 

X^(n)Ah^(n) = e,(n), (5.51) 

where Ah^(n) = hg(n) — hq{n - 1). 

Equation (5.51) (N equations in EL unknowns, N < PL) is an underde- 
termined set of linear equations. Hence, it has an infinite number of solutions, 
out of which the minimum-norm solution is chosen. This results in [20], [21]: 

h,(n) = hq{n - 1) + X(n) [X^(n)X(n)] eq{n). (5.52) 

However, in this straightforward APA, the normalization matrix 
X^(n)X(n) = Y^p-i Xj(n)Xp(n) does not involve the cross-correlation el- 
ements of the P input signals [namely Xf (n)Xp(n), i,p = 1, 2, i ^ p] 
and this algorithm may converge slowly. 

6.2 THE IMPROVED TWO-CHANNEL APA 

A simple way to improve the previous adaptive algorithm is to use the othog- 
onality and decorrelation properties, which will be shown later to appear in this 
context. Let us derive the improved algorithm by requiring a condition similar 
to the one used in the improved multichannel LMS. Just use the constraint that 
Ahpg be orthogonal to Xj, j ^ p. As a result, we take into account separately 
the contributions of each input signal. These constraints read: 

X^(n)Ahiq(n) = Oatxi, (5.53) 

Xf(n)Ah 2 ,(n) = Oivxi, (5.54) 

and the new set of linear equations characterizing the improved two-channel 
APA is: 



XT in) X^n) 
X^n) 

OiVxL N^(n) 



Ahi^(n) 

Ah2,(n) 



Bgin) 

OnxI 

OnxI 



(5.55) 
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The improved two-channel APA algorithm is given by the minimum-norm so- 
lution of (5.55) which is found as [20], 

Ahi,(n) = Zi(n) [Zf(n)Zi(n)-f-Z|’(n)Z 2 (n)]“^e,(n), (5.56) 
Ah 2 ,(n) = Z 2 (n)[zf(n)Zi(n)-bZ'f(n)Z 2 (n)]~'e,(n), (5.57) 

where Zp(n) is the projection of Xp(n) onto a subspace orthogonal to 
Xj(n), 

Zp(n) = - Xj(n) [Xj(n)Xj(n)]"' X^(n)} Xp(n), (5.58) 

P,j = 1,2, p j. 

This results in the following orthogonality conditions, 

Xj(n)Zj(n) = Onxn, p¥^3 (5.59) 

which are similar to what appears in the improved multichannel LMS (Lemma 

2). 

6.3 THE IMPROVED MULTICHANNEL APA 

The algorithm explained for two channels is easily generalized to an arbitrary 
number of channels P. Define the following matrix of size L x (P - 1 )N\ 

Xp(n) = [Xi(n) Xp_i(n) Xp+i(n) Xp(n) ] , 

p=l,2,...,P. 



The P orthogonality constraints are: 

Xj(n)Ahp,(n) =0(p_i);vxi, p= 1,2,...,P, (5.60) 



and by using the same steps as for P - 2, a solution similar to (5.56), (5.57) is 
obtained [20]: 



Ahp,(n) = Zp(n) 



P 

^Zjin)Zj(n) 

f=i 



e,(n), p = 1,2,...,P, (5.61) 



where Zp(n) is the projection of Xp(n) onto a subspace orthogonal to Xp(n), 
i.e., 



Zp(n) = {l^><L--Xp(n)[X>)Xp(n)]-'Xp(n)}Xp(n),(5.62) 
p-l,2,...,P. 
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Note that this equation holds only under the condition L > (P — 1)A^, so that 
the matrix that appears in (5.62) he invertible. 

We can easily see that: 

(t^)Zp(rt) = O(p-i)yvxN) P ~ 1)2) •••) P- (5.63) 

Fast versions of the single- and multi-channel APA can he derived [22], [23], 
[24]. 

7. THE MULTICHANNEL EXPONENTIATED 
GRADIENT ALGORITHM 

Room acoustic impulse responses are often sparse. Our interest in exponen- 
tiated adaptive algorithms is that they converge and track much faster than the 
LMS algorithm for this family of impulse responses. 

One easy way to find adaptive algorithms that adjust the new weight vector 
hg(n -|- 1) from the old one hg(n) is to minimize the following function [25]: 

J[hq(n 4- 1)] = d[hg{n + 1), hg(n)] + + 1), (5.64) 

where d[hg{n -I- 1 ), hg(n)] is some measure of distance from the old to the new 
weight vector, 

e&,q{n + 1) = yq{n + 1) - h^(n + l)x(n -f 1) (5.65) 

is the a posteriori error signal, and ty is a positive constant. (This formulation is a 
generalization of the case of Euclidean distance.) The magnitude of rj represents 
the importance of correctness compared to the importance of conservativeness 
[25 ] . If 7 ? is very small, minimizing J[hq (n-l- 1 )] is close to minimizing c([hg(n 4 - 
|),h,(n)], so that the algorithm makes very small updates. On the other hand, 
if 77 is very large, the minimization of J[hq (n -F 1 )] is almost equivalent to 
minimizing c([hq(n + l),hg(n)] subject to the constraint eg,^q{n -F 1 ) = 0 . 

To minimize J[hq{n + 1)], we need to set its PL partial derivatives 
dJ[hq{n + 1)]/ dhpq^i{n -F 1) to zero. Hence, the different weight coeffi- 
cients hpq^i{n + l), I = 0, 1, ..., L - l,p = 1, 2, ..., P, will be found by solving 
the equations: 

— - - 2f]Xp{n -F 1 - /)ea q{n -F 1) = 0. (5.66) 

1 ) 

Solving (5.66) is in general very difficult. However, if the new weight vector 
hg(n -F 1) is close to the old weight vector hg(n), replacing the a posteriori 
error signal es.^q{n -F 1) in (5.66) with the a priori error signal eg(n + 1) is a 



9d[hg(n 

dhpq 
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reasonable approximation and the equation 



+ 1), hg(n)] ^ 2r]Xp{n + 1 - l)eq{n + 1) = 0 (5.67) 

dhpg^iin + 1 ) 

is much easier to solve for all distance measures d. 

The LMS algorithm is easily obtained from (5.67) by using the squared 
Euclidean distance 




dEK(n + l),h,(n)] = \\hq{n + 1) - h,(n)||^. (5.68) 

The exponentiated gradient (EG) algorithm with positive weights results from 
using for d the relative entropy, also known as Kullback-Leibler divergence, 



dre[hq{n+l),hq{n)] = EE dpq, l{n + 1) In 



dpq^i (n, T 1 ) 



P L-l ^ 

hpq,i{n + 1) In - ^ 
p=\ 1=0 dpq^i{n) 

with the constraint hpq^i{n + 1) = 1, so that (5.67) becomes: 

ddre[hq{n + l),hg(n)] 



dhpqi{n + 1) 



(5.69) 



- 2r/a:p(n + 1 - ()c9(n + 1) + 7 = 0, (5.70) 



where 7 is the Lagrange multiplier. Actually, the appropriate constraint should 
beEp E/ = Ep E;^P9,/butEp Ei Zip?,/ is notknown in practice, 

so we use the arbitrary value 1 instead. 

The algorithm derived from (5.70) is: 



where 






Ej=i Ej=o + 1) 



(5.71) 



f'pq,i{n + 1) = exp [2T]Xp{n + 1 - l)eq{n + 1)] . (5.72) 

This algorithm is valid only for positive coefficients. To deal with both 
positive and negative coefficients, we can always find two vectors h^(n + 1) 
and (n + 1) with positive coefficients, in such a way that the vector 

hg(n + l) =h^(n + l) -h;(n+ 1) (5.73) 

can have positive and negative components. In this case, the a posteriori error 
signal can be written as: 

ea,9(n + 1) = Vqin + 1) - [h^(n + 1) - h, (n + l)]'^x(n + 1) (5.74) 
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and the function (5.64) will change to: 



J[hg (n + l),h, (n + 1)] = d[h^(n + l),hj'(n)] 



(5.75) 



+ 4^9 + (n)] + 1 ), 

where u is a positive scaling constant. Using the same approximation 
as before and choosing the Kullback-Leibler divergence plus the constraint 
n + 1) + hpgiin + 1)] = u> the solutions of the equations 

adre[h^(n+l),hj'(n)] , , , n / , .X , 

- — 2-Xp{n + 1 - l)eg{n + 1) + 7 = 0, 



(5.76) 



ddre[hg (n + l),ii, (n)] ^ 



-f 2-27p(n + l -/)e,(n + l) + 7 = 0, 

dhpg/n + l) u 

(5.77) 



give the so-called EG± algorithm: 






s,(n -f-1) 

t- r , 1 . 

’ s,(n-M) 



(5.78) 

(5.79) 



where 



P L-l 

,(n + 1) = YlYl + 1 )] > 

i=l j=0 

(5.80) 



+ = exp 



+ = exp 



^Xp{n -|- 1 - l)eg{n + 1) 

-~Xp{n + 1 - l)eg{n + 1) 
u 



(5.81) 

(5.82) 



1 )’ 



eqin + l) = y,(n-hl)-[h (n)-h (n)fx(n-t-l). 



(5.83) 
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We can check that we always have ||h^(?T' + l)||i + |ihq {n + l)||i = U- Upon 
convergence: 

l|hg(oo)||i = lihglli 

= l|hq (oo) -h, (oo)lli 

< l|h^(oo)i!i + l|hg (oo)lli = ti, (5.84) 

hence, the constant u should be chosen such that u > ||hg||i. 

A normalized version of the multichannel EG± algorithm is given below; 



Initialization : 

KoM = V9/0) = c>0, P = 1,2,...,F, ; = 0,1 ,...,L- 1. 
Parameters : 

u > limit, 

0 < a < 1 , 5 > 0 . 

Error : 

yg{n + 1) - [h^(n) - {n)fx{n + 1). 

Update : 



gg(n 4- 1) 
fj,{n + 1) 

Sq{n + 1) 
1) 



a 



x^(n + l)x(n 4- 1) + i5’ 
exp 



— a;p(n 4- 1 - l)eq{n 4- 1) 



u 



1 



= u 



+ 1)’ 

P L-l 

1 ) 4 “ ^iqji^')'^iq,ji'^ 4 ” 1 ) 

i=l j=0 



Sq{n 4- 1) 



Sg(n4-1) 

p=l, 2,...,P, / = 0,1 ,...,L- 1 . 



Intuitively, exponentiating the update has the effect of assigning larger rel- 
ative updates to larger weights, thereby deemphasizing the effect of smaller 
weights. This is qualitatively similar to the PNLMS algorithm [26] which 
makes the update proportional to the size of the weight. This type of behavior 
is desirable for sparse impulse responses where small weights do not contribute 
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significantly to the mean solution but introduce an undesirable noise-like vari- 
ance. 

Recently, the proportionate normalized least-mean-square (PNLMS) algo- 
rithm was developed for use in network echo cancelers [26]. In comparison 
to the NLMS algorithm, PNLMS has very fast initial convergence and track- 
ing when the echo path is sparse. As previously mentioned, the idea behind 
PNLMS is to update each coefficient of the filter independently of the others 
by adjusting the adaptation step size in proportion to the estimated filter coeffi- 
cient. More recently, an improved PNLMS (IPNLMS) [27] was proposed that 
performs better than NLMS and PNLMS, whatever the nature of the impulse 
response is. The IPNLMS is summarized below: 



Initialization : 

Vi(0) = 0, p = l,2,...,P, Z = 0,1,...,L-1. 

Parameters : 



0 < a <1, ^IPNLMS > 0, 

-1 < «; < 1 , 

£ >0 (small number to avoid division by zero). 



eq{n+l) = 



9pq,li.^) 

p - 
= 

hpq^i{n + \) = 

p = 



Error : 

''T 

Vq{n-\- 1) - hq (n)x(n 4- 1). 



Update 

1 - 



-h (1 -f k)- 



\hpqj.{n)\ 

"'''2||h,(n)||i-he’ 

1,2,...,P, / = 0,1,...,L-1, 
a 



EjLi Ej=o 1 - j)9iq,j{n) + (5ipnlms ’ 

KqA^) + + ^)9pq,l{'^)^pi'^ + 1 ~ 0eq(n + 1), 



1,2,...,P, / = 0,1,...,L - 1. 



In general, Ppq i in the IPNLMS provides the “proportionate” scaling of the 
update. The parameter k controls the amount of proportionality in the update. 
For K = —1, it can easily be checked that the IPNLMS and NLMS algorithms 
are identical. For k close to 1, the IPNLMS behaves like the PNLMS algorithm 
[26]. In practice, a good choice for k is 0 or -0.5. 

We can show that the IPNLMS and EG± algorithms are related [28]. The 
IPNLMS is in fact an approximation of the EG± if we approximate in the latter 
exp(a) Si 1 -F a for |a| <g; 1. 
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8. THE MULTICHANNEL EREQUENCY-DOMAIN 
ADAPTIVE ALGORITHM 

Adaptive algorithms in the frequency domain are, in general, extremely effi- 
cient since they use the fast Fourier transform (FFT) as an intermediary step. As 
a result, they are now implemented in many prototypes and products for acous- 
tic echo cancellation. In this section, we briefly explain how these algorithms 
can be derived rigorously from a block error signal. 

From now on and for simplification, we drop the parameter q in all equations. 
With this simplification, the error signal at time n is now: 

P 

e{n) = y{n) - ^ hp Xp(n), (5.85) 

p=i 



where hp is the estimated impulse response of the pth channel, 

^ ^ ^ ^ '’f' 

hp [ hpfi hp^i • • • hpi,~i ] 

We now define the block error signal (of length N < L). For that, we assume 
that L is an integer multiple of N, i.e., L = KN. We have: 



e(m) = y(m) - y(m) 

p 

= y(m) - j;^Xj(m)hp, (5.86) 

p=i 

where m is the block time index, and 



e(m) = 


[ e{mN) ■ ■ 


• e{mN + N - 1) f 


y(m) = 


[ yi-mN) 


■ y(mN + N - 1) 


Xp(m) = 


[ Xp{mN) ■ 


• • \p{mN + N - 1) ] 


y(m) = 


[ y{mN) 


• y{mN + N - 1) 



It can easily be checked that Xp is a Toeplitz matrix of size (L x N). 

We can show that for K = UN, we can write 

K-l 

Xj'(m)hp = Y. (5-87) 

k==0 

where T(m — k) is an x AO Toeplitz matrix and 





- - 

* * * hpjkN-\-N—\ 5 ^ 1 ) •••! ^ 1 ) 
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are the sub-filters of hp. In (5.87), the filter hp (of length L) is partitioned into 
K sub-filters hp i^ of length N and the rectangular matrix Xj [of size (N x L)] 
is decomposed to K square sub-matrices of size (N x N). 

It is well known that a Toeplitz matrix Tp can be transformed, by doubling 
its size, to a circulant matrix Cp. Also, a circulant matrix is easily decomposed 
as follows: Cp = F 2 ^^ 2 /v®p^ 2 Atx 2 yv. where ¥2Nx2N is the Fourier matrix [of 
size (2N X 2N)] and Dp is a diagonal matrix whose elements are the discrete 
Fourier transform of the first column of Cp. If we multiply (5.86) by F/\fxW 
[Fourier matrix of size (N x N)], we get the error signal in the frequency 
domain (denoted by underbars): 



P K-i 

e(m) = y(m) - p(m - k)G 2 %xN^^p,k 

p=l k=0 
P K-\ 

= y(m) - ~ 

p-l Jfc=0 
p 

y(m) - 

p=i 

= y(m) - (5.88) 



where 



e(m) 

y{m) 



GOi 



Nx2N — 



”Nx2/V 

/-•lO 

^2NxN 



FNxNe('m), 

F/vx/v'y(m), 
¥nxN^^^Nx2N^2Nx2N' 
[ Onxn Inxn ] , 



= ¥2Nx2N^2%xN^ 



NxN' 



^2NxN 




Ia'xA' 

_ ^NxN ’ 








bp, A: 


= F/vxArhp,fc, 








.s 




' - T - T 

hp,o bpq 


.-7 


. 


T 


hp 


— 


hp 


,/r-i 


) 






r -T 


."7 


r 




h 




[bi h2 


hp 


j 




Dp(m - k) 


= ¥2Nx2NGp{m - 


- k)F2^^2N^ 




llp(m - k) 


= Dp(7n-A:)G^^ 


xtV> 






Mp(m) 


= 


Up(m) Up(m — 1) 




Up(m - 
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y(m) = [ IJi(m) U 2 (m) IJp(m) ] , 

= [ Dp("i) Dp("i - 1) • ■ ■ - K + 1) ] , 

D(m) = [ Di(m) D 2 (m) Dp(m) ], 

y^PLxPL — [ G2?/xiV ■■■ ]' 

y(m) =: D(m)G^VxPi- 

The size of the matrix U is (2N x PL) and the length of h is PL. 

By minimizing the criterion 

m 

Jf(m) = (1 - A) A”*-*e"(ye(t) , (5.89) 

i=0 

where ^ denotes conjugate transpose and A (0 < A < 1) is an exponential 
forgetting factor, we obtain the normal equations for the multichannel case: 

S(m)h(m) = s(m), (5.90) 

where 



S{m) = AS(m - 1) (5.91) 

+ (1 - A)(G2ppxPL)^B^(’^)^2Nx2Ary(’^)G2PLxPL 



is a (PL X PL) matrix, 

^2Nx2N 



— (G^x2A/)^C1' 

— F2iVx2/vW2// 



Nx2N 



x2N^2Nx2N' 



s(m) = As(m - 1) 4- (1 - A)(G 2 Pz,xPL)^D^("i)y 2 Ar("^) 

is a (PL X 1) vector. Assuming that the P input signals are not perfectly 
pairwise coherent, the normal equations have a unique solution which is the 
optimal Wiener solution. 

Define the following variables: 



y2^M 



(GNx2Ny^yM 



= F2Nx2N 



^2N[^) = ^2Arx2yV 

= (G?/x2yv)"e(m), 

G 2 PLx 2 PL = diag [ G 2 ^x 2 N 
b2PL("^) = G2 ppxPl£(™)- 



pio 1 
^2Nx2N J ) 
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We can show that from the normal equations, we can exactly derive the following 
multichannel frequency-domain adaptive algorithm: 

Q(m) = AQ(m-l)-h(l-A)D^{m)G°]y^ 2 tvfiM, (5-93) 

§2nM = y2;v^™') “ ^2U2ivB(m)h2p/,(m - 1), (5.94) 

h2FZ,(’7i) = b2Pi,(m - 1) 

+ (1 - A)G2ppx2FlQ ^("*)D'^(”2)e2yv(m). (5.95) 

Depending of the approximations we make on matrix Q(m), we obtain different 
algorithms. There is a compromise between performance (convergence rate) 
and complexity. Several algorithms can be deduced from the previous form, 
some of them are well-known while others are new. See [29] for a general 
derivation of adaptive algorithms in the frequency domain. See also [30] and 
[31] for new algorithms derived directly from the above equations. 

9. CONCLUSIONS 

In this chapter, we have given an overview on multichannel adaptive algo- 
rithms in the context of multichannel acoustic echo cancellation. We have first 
derived the normal equations of a MIMO system and discussed the identifi- 
cation problem. We have shown that, in the multichannel case and when the 
input signals are linearly related, there is a nonuniqueness problem that does not 
exist in the single-input single-output (SISO) case. In order to have a unique 
solution, we have to decorrelate somehow the input signals without affecting 
the spatialization effect and the quality of the signals. We have then derived 
many useful adaptive algorithms in a context where the strong correlation be- 
tween the input signals affects seriously the performance of several well-known 
algorithms. 
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Abstract Double-talk detectors (DTDs) are vital to the operation and performance of acous- 

tic echo cancelers. In this chapter, we highlight important aspects needed to be 
considered when choosing and designing a DTD. The generic double-talk de- 
tector scheme along with fundamental means of performance evaluation are dis- 
cussed. A number of double-talk detectors suitable for acoustic echo cancelers 
are presented and objectively compared. 

Keywords: Double-Talk Detector, Acoustic Echo Canceler, Adaptive Algorithms, LMS, 

NLMS, Coherence, Cross-Correlation, Robust Statistics 



1. INTRODUCTION 

The design of a good double-talk detector is much more of an art than the 
design of the adaptive filter itself. [ 1 ]’ 

Ideally, acoustic echo cancelers (AECs) remove undesired echoes that result 
from coupling between the loudspeaker and the microphone used in full-duplex 
hands-free telecommunication systems. Figure 6.1 shows a basic AFC block 
diagram. The far-end speech signal x{n) goes through the echo path represented 
by a filter /i(n), then it is picked up by the microphone together with the near- 
end talker signal v{n) and ambient noise w{n). The microphone signal is 
denoted y{n). Most often the echo path is modeled by an adaptive FIR filter, 
h{n), that generates a replica of the echo. This echo estimate is then subtracted 




150 Audio Signal Processing 




Figure 6. 1 Block diagram of a basic AEC setup. 



from the return channel and thereby cancellation is achieved. This may look 
like a simple straightforward system identification task for the adaptive filter. 
However, in most conversation there are so called double-talk situations that 
make the identification much more problematic than what it might appear at 
a first glance. Double-talk occurs when the speech of the two talkers arrive 
simultaneously at the echo canceler, i.e. x{n) ^ 0 and v{n) ^ 0 (the situation 
with near-end talk only, x{n) = Oand'u(n) ^ 0, can be regarded as an “easy-to- 
detect” double-talk case). In the double-talk situation, the near-end speech acts 
as a large level uncorrelated noise to the adaptive algorithm. The disturbing 
near-end speech may cause the adaptive filter to diverge. Hence, annoying 
audible echo will pass through to the far-end. The usual way to alleviate this 
problem is to slow down or completely halt the filter adaptation when presence 
of near-end speech is detected. This is the very important role of the so called 
double-talk detector (DTD). The basic double-talk detection scheme is based 
on computing a detection statistic, and comparing it with a preset threshold, 
T. The important issues that have to be addressed when designing a DTD are: 

(i) What basic knowledge is needed in order to devise a sufficient DTD 
solution? 

(ii) What characterize “good” double-talk detectors? 

Primarily, we must know under what circumstances double-talk disturbs the 
adaptive filter. The performance of AECs are most often evaluated through 
their mean-square error (MSE) performance or preferably, through their mis- 
alignment e = ||h-h|| 2 /||h ||2 in different situations. The misalignment formula 
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reveals the answer to question (i). This formula is given by 



E{e^} 



/^o 

2EBR’ 



( 6 . 1 ) 



where El-} denotes mathematical expectation and/xois a constant parameter of 
the adaptive algorithm. That is, the level of convergence (misalignment perfor- 
mance) is completely governed by echo-to-background (near-end speech and 
noise) power ratio, EBR. If we have equally strong talkers during double-talk, 
we find that when the echo path has a high attenuation, the adaptive filter will be 
very sensitive to double-talk. On the contrary, a low attenuation or amplification 
of the echo path means lower sensitivity. This is important to remember when 
chosing the parameters of the DTD. Furthermore, it naturally shows why we 
should lower the step-size parameter /xq in high noise or double-talk conditions 
in order to maintain a certain performance. Equation (6.1) is easily found by 
assuming the far- and near-end signals are uncorrelated stochastic processes and 
the adaptive algorithm is NEMS [3], Furthermore, if we suppose that the sig- 
nals are white noises, the misalignment is exactly the inverse of the echo return 
loss enhancement (FREE) which reflects the echo attenuation provided by the 
AEC. Even though these assumptions describe an oversimplified model of the 
AEC situation, it gives the valuable insight needed to what governs divergence 
of the adaptive algorithm. 

Issue (ii) can be partially addressed by characterizing DTDs through the use 
of general detection theory [4, 5, 6]. By means of detection probability and 
false alarm rates we can objectively evaluate and compare the performance of 
different DTDs. Moreover, the theory justifies a method to select the threshold 
T which has been missing in the field of DTD design. One must, however, also 
accompany these performance measures with a joint evaluation of the DTD and 
the echo canceler. 

A large number of DTD schemes have been proposed since the introduction of 
echo cancelers [7]. The Geigel algorithm [8] has proven successful in network 
echo cancelers; however, it does not always provide reliable performance when 
used in the acoustic situation. This is because it assumes a minimum echo 
path attenuation which may not be valid in the acoustic case. Other methods 
based on cross-correlation and coherence [9, 10,4, 5] have been studied which 
appear to be more appropriate for the acoustic application. Spectral comparing 
methods [11] and two-microphone solutions have also been proposed [12]. A 
DTD based on multi statistic testing in combination with modeling of the echo 
path by two filters is proposed in [13]. The objective of this chapter is to 
summarize some of the DTD proposals and present evaluation methods. Many 
results in this chapter are derived from the papers [5, 6]. 

The chapter is organized as follows: Section 2 introduces the AEC notations 
and describes the general DTD scheme. A number of double-talk detection 
algorithms that have been proposed and used in acoustic echo cancelers are 
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presented in Section 3. Section 4 compares a selected number of DTDs by 
means of their receiver operation characteristics. Section 5 gives a discussion 
of the abilities of different DTD schemes and summarizes the important aspects 
that need to be considered for a successful double-talk detector implementation. 

2. BASICS OF AEC AND DTD 

In this section, we give the basics of an AEC combined with a DTD. We first 
formulate the AEC problem. 

2.1 AEC NOTATIONS 

The AEC setup in Eig. 6.1 is described in mathematical terms as: 

y{n) = x(n) + v(n) -b w(n), (6.2) 



where 



h = [ /io hi ■ ■ ■ hig-i , 
x(n) = [ x{n) x{n — 1) • • • x{n — N + 1) , 

and N is the length of the echo path response, h. The error signal is defined as 

e(n) = y{n) - [ ]x(n) 

= Ah^x(n) -f 'u(n) -b in(n), (6.3) 

where 

h = [ Ao hi ■■■ ht-i (6.4) 

is the adaptive filter coefficient vector of length L (generally less than N), and 



Ah = h- 



h ■ 
0 



(6.5) 



2.2 THE GENERIC DTD 

Double-talk detectors basically operate in the same manner. Thus, the gen- 
eral procedure for handling double-talk is described by the following: 

1. A detection statistic ^ is formed using available signals, e.g. x,y, e, etc., 
and the estimated filter coefficients h. 

2. The detection statistic ^ is compared to a preset threshold T, and double- 
talk is declared if ^ < T. 

3. Once double-talk is declared, the detection is held for a minimum period 
of time Thoid • While the detection is held, the filter adaptation is disabled. 
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4. If ^ > Tconsecutively over a time Thoid. the filter resumes adaptation, 
while the comparison of ^ to T continues until ^ <T again. 

The hold time Thoid in Step 3 and Step 4 is necessary to suppress detection 
dropouts due to the noisy behavior of the detection statistic. Although there are 
some possible variations, most of the DTD algorithms keep this basic form and 
only differ in how to form the detection statistic. 

An “optimum” decision variable ^ for double-talk detection will behave as 
follows; 

(i) ifu(n) = 0 (double-talk is not present),^ > T. 

(ii) ifu(n) ^ 0 (double-talk is present),^ < T. 

(hi) i is insensitive to echo path variations. 

The threshold T must be a constant, independent of the data. Moreover, it is 
desirable that the decisions are made without introducing any delay (or mini- 
mize the introduced delay) in the updating of the adaptive filter. The delayed 
decisions will otherwise affect the AEC algorithm negatively. 

2.3 A SUGGESTION TO PERFORMANCE 
EVALUATION OF DTDS 

The role of the threshold T is essential to the performance of the double-talk 
detector. To select the value of T and to compare different DTDs objectively 
one could view the DTD as a classical binary detection problem. By doing so, it 
is possible to rely on established detection theory. This approach to characterize 
DTDs was proposed in [4, 6]. 

The general characteristics of a binary detection scheme are: 

■ Probability of False Alarm (Ff); Probability of declaring detection when 
a target, in our case double-talk, is not present. 

■ Probability of Detection (Fd): Probability of successful detection when 
a target is present. 

■ Probability of Miss (Fm = 1 — Fj): Probability of detection failure when 
a target is present. 

A well designed DTD maximizes Fj while minimizing Ff even in a low SNR. 
In general, higher Fj is achieved at the cost of higher Ff. There should be a 
tradeoff in performance depending on the penalty or cost function of a false 
alarm. 

One common approach to characterize different detection methods is to rep- 
resent the detection characteristic Fu (or Fm) as a function of false alarm prob- 
ability Ff under a given constraint on the SNR. This is known as a receiver 
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operating characteristic (ROC). The P( constraint can he interpreted as the 
maximum tolerable false alarm rate. 

Evaluation of a DTD is carried outhy estimating the performance parameters, 
Pd {Pm) and Pf. A principle for this technique can he found in [6]. Though 
in the end, one should accompany these performance measures with a joint 
evaluation of the DTD and the AEC. This is due to the fact that the response 
time of the DTD can seriously affect the performance of the AEC and this is in 
general not shown in the ROC curve. 

3. DOUBLE-TALK DETECTION ALGORITHMS 

In this section, we explain different DTD algorithms that can he useful for 
AEC. We start with the Geigel algorithm since it was the very first DTD pro- 
posal. 

3.1 THE GEIGEL ALGORITHM 

A very simple algorithm due to A. A. Geigel [8] is to declare the presence 
of near-end speech whenever 

.(g) ^ max{|a:(n)|,...,|a:(n-Lg + l)|} ^ ^ 

|y(«)l 

where Lg and T are suitably chosen constants. This detection scheme is based 
on a waveform level comparison between the microphone signal y{n) and the 
far-end speech x{n) assuming the near-end speech v(n) in the microphone 
signal will be typically stronger than the echo h^x. The maximum or /qo norm 
of the Lg most recent samples of x(n) is taken for the comparison because of 
the undetermined delay in the echo path. The threshold T is to compensate for 
the energy level of the echo path response h, and is often set to 2 for network 
echo cancelers because the hybrid loss is typically about 6 dB or more. Eor an 
AEC, however, it is not clear how to set a universal threshold to work reliably 
in all the various situations because the loss through the acoustic echo path can 
vary greatly depending on many factors. Eor Lg, one choice is to set it the same 
as the adaptive filter length L since we can assume that the echo path is covered 
by this length. 

3.2 THE CROSS-CORRELATION METHOD 

In [9] the cross-correlation coefficient vector between x and e was proposed 
as a means for double-talk detection. A similar idea using the cross-correlation 
coefficient vector between x and y has proven more robust and reliable [10, 6]. 
This section will therefore focus on the cross-correlation coefficient vector 
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between x and y which is defined as 

( 1 ) _ E{x{n)y{n)} 



xy 



^xy 



Jl) Jl) 

^xy,0 ^xy,l 



^xy,L-l 



(6.7) 



where is the cross-correlation coefficient between a?(n — i) and 2 /(n). 
The idea here is to compare 






4^1 



= i = 0,1,...,L - 1 



(6.8) 



to a threshold level T. The decision rule will be very simple: if > T, then 
double-talk is not present; if^(^^ < T, then double-talk is present. 

(Although the Zoo norm used in (6.7) is perhaps the most natural, other scalar 
metrics, e.g., l\, Z 2 , could alternatively be used to assess the cross-correlation 
coefficient vectors. However, there is a fundamental problem here which is not 
linked to the type of metric used. The problem is that these cross-correlation 
coefficient vectors are not well normalized. Indeed, we can only say in general 
that < 1. If v{n) = 0, that does not imply that = 1 or any other 
known value. We do not know the value of ^(^^ in general. The amount of 
correlation will depend a great deal on the statistics of the signals and of the 
echo path. As a result, the best value of T will vary a lot from one experiment 
to another. So there is no natural threshold level associated with the variable 
^(^) when v{n) = 0. 

Next section presents a decision variable that exhibits better properties than 
the cross-correlation algorithm. This decision variable is formed by properly 
normalizing the cross-correlation vector between x and y. 



3.3 THE NORMALIZED CROSS- CORRELATION 

METHOD 

There is a simple way to normalize the cross-correlation vector between a 
vector X and a scalar y in order to have a natural threshold level for ^ when 
v(n) = 0. 

Suppose that v(n) = 0. In this case: 

cty = h^Rxxh, 

where R^x = J5{x(n)x^(n)}. Since y(n) = h^x(n), we have 



(6.9) 
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and (6.9) can be re-written as 



( 6 . 11 ) 

In general for v{n) 7 ^ 0 we have, 

^ '^xyKx'^xy + (tI ( 6 . 12 ) 

If we divide (6.11) by (6.12) and compute its square root, we obtain the decision 
variable [5, 14] 



= llcgib, (6.13) 

where 

eg = (6.14) 

is what we will call the normalized cross-correlation vector between x and y. 

Substituting (6.10) and (6.12) into (6.13), we show that the decision variable 
is: 

^(2) ^ \/h^R.xh ^ 
y h^Rixh -t- 

We easily deduce from (6.15) that for v{n) = 0, ^(^^ = 1 and for v{n) 7 ^ 0, 
< 1. Note also that is not sensitive to changes of the echo path when 
V = 0. 

For the particular case when x is white Gaussian noise, the autocorrelation 
matrix is diagonal: R^x = <7^1. Then (6.14) becomes: 











( 6 . 16 ) 



Hence, in general what we are doing in (6.13) is equivalent to prewhiten- 
ing the signal x, which is one of many known “generalized cross-correlation” 
techniques [15]. Thus, when x is white, no prewhitening is necessary and 
eg* = eg*. This suggests a more practical implementation, whereby matrix 
operations are replaced by an adaptive pre whitening filter [16]. 

Finally, a fast version of (6.15) can be derived by recursively updating R~J txy 
using the Kalman gain X [3]. Estimated quantities of the cross-correlation 
and the near-end signal power have to be introduced for the derivation of a fast 
version. Equation (6.15) can be rewritten as 
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Table 6.1 The fast version of the NCC double-talk. Note that we need to distinguish between 
the “background” echo path estimate calculated in the DTD, denoted hb(n). and the estimate 
calculated in the echo canceler, h(n) . The a postriori Kalman gain is k' (n) =)R“ * (n)x(n) . 



Double-talk detector 

alin) = A<T2(n-l)-hy2(n) 
eh{n) = y{n) - h^(n - l)x(n) 

T}'^(n) = \ri\n - 1) -f y^{n) - ip{n)el{n) 
T](n)f(Ty(n} < T, =i- double-talk, pt = 0 
t](n)/ay(n) > T, => no double-talk, ft = 1 

hb(n) = hb(n-l) + k'(n)^ 

(p(n) 



where we squared the statistic for simplicity. The correlation variables are 



estimated recursively as, 

r(n) = Xr(n - 1) -h x(n)y(n), (6.18) 

R(n) = AR(n - 1) + x(n)x^ (n), (6.19) 

CT^(n) = A<r2(n- 1) -f y^(n), (6.20) 

a(n) = A -f x^^(n)R“’(n - l)x(n), (6.21) 

where A < 1 is a forgetting factor. The statistic rj^(n) can he shown to be 
updated as 

7j^(n) = Ay^(n - 1) + y^(n) - (p(n)e^(n) , (6.22) 



where the likelihood variable (p{n) = X/a{n) and e(n) is the residual error, 
e(n) = y(n) — y(n). Hence, the quantities required to form the test statistic of 
the fast version of the NCC DTD are given by the simple first-order recnrsions 
in (6.20) and (6.22). 

Table 6.1 gives the calculations for the fast NCC DTD, where it is assumed 
that the Kalman gain has been calculated “for free’’ by the FRLS algorithm [17]. 

3.4 THE COHERENCE METHOD 

Instead of using the cross-correlation vector, a detection statistic can be 
formed by nsing the squared magnitude coherence. A DTD based on coherence 
was proposed in [4]. The idea is to estimate the coherence between x(n) and 
y(n). The coherence is close to one when there is no double-talk and it is close 




158 Audio Signal Processing 



a b 





Figure 6.2 Estimated coherence using the multiple window method, (a) Far-end talker is active 
only, (b) Double-talk situation where the far- and near-end signals powers are equal. Echo path 
attenuation is 6 dB and the ambient noise power is 25 dB lower than the far-end speech. 



to zero in a double-talk situation. Figure 6.2 shows an example of estimated 
coherence between loudspeaker and microphone signals in the presence and 
absence of double-talk. The squared coherence is defined as, 




|gx,(fe)P 

S,Ak)Syy{ky 



(6.23) 



where S..{k) is the DFT based cross-power spectrum and k is the DFT frequency 
index. As decision parameter, an average over a few frequencies is used as 
detection statistics. 



= 7E7x,(fcr), (6.24) 

^ i=0 

where I is the number of intervals used. Typical choices of these parameters are 
1-3 and ko, k\, k 2 are the intervals chosen such that their center correspond 
to approximately 300, 1200, and 1800 Hz respectively. This gives in practice 
a significantly better performance than averaging over the whole frequency 
range since there is a poorer speech-to-noise ratio in the upper frequencies (the 
average speech spectrum declines with about 6 dB/octave above 2 kHz). 
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Estimation of the spectra in (6.23) can be made by using the multiple window 
technique [18], where 



1 


(6.25) 


p=0 

1 

^-y(k) = pY^Mk)y;(k), 

p~0 


(6.26) 



where Xp(k) is the p ; th eigenspectrum 

~ ^ k 

Xp(k) = x{Lc - I - (6.27) 

n=0 

and Yp{k) is analogously defined. The window (j)p{n) is the p : th discrete 
spheroidal wave function [19]. Lc is the block length of the DFT. The multiple 
window method has advantages such as easy tradeoff between bias and variance. 
Another possibility is to use the Welch spectrum estimation method [20]. 

Since this DTD is based on block processing of the signals, there is a tradeoff 
between calculation complexity and time between decisions. It is desirable to 
keep the time between decisions as short as possible in order to have as low 
detection failures as possible (both false alarm and detection miss). 

3.5 THE NORMALIZED CROSS-CORRELATION 
MATRIX 

Obviously, the cross-correlation and coherence methods are related in some 
sense. This link can be established by extending the definition of the cross- 
correlation method to incorporate correlation between two vectors x and y 
instead of only the scalar y{n) [5]. Define the normalized cross-correlation 
matrix Cxy between two vectors x and y as follows 

Cxy = (6.28) 

where 

y{n) = [ y{n) y(n - 1 ) ••• t/(n - A -f 1) 

is a vector of size N. There are two interesting cases: 

(i) N - 1, Cxy = cfxy (normalized cross-correlation vector between x and 

y)- ^ 

(ii) N - L - I, Cxy = c^xyO (cross-correlation coefficient between x and 

y)- 
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By extension to (6.13), we then form the detection statistic 



(6.29) 

where the subscript “F” denotes the Frobenius norm. We note that for case (i), 
z= ^(2) as before. Again, we can interpret this formulation as a “generalized 
cross-correlation”, where now both x and y are prewhitened, which is also 
known as the “smoothed coherence transform” (SCOT) [15]. 

The link between the normalized cross-correlation matrix and the coherence 
can now be established as follows: Suppose that N = L oo. In this case, a 
Toeplitz matrix is asymptotically equivalent to a circulant matrix if its elements 
are absolutely summable [21], which is the case for the intended application. 
Flence we can decompose Kab 

Ra6 = F-'SafcF, (6.30) 

where F is the discrete Fourier transform (DFT) matrix and 

S„i, = diag{5^fc(0),5„i(l),--- ,Sab{L-l)} (6.31) 

is a diagonal matrix formed by the first column of FRah, and 



-too 



Sabik) = 

m=:"00 

+00 



is the DFT cross-power spectrum. Now: 

tr(C;^,Cx,) = tr(R-V2R,,R-^lR^,R-V2) 

~ ^*'(Rj/xRxi RxyRj/j/ ) 

since tr(AB) = tr(BA). Using (6.30), we easily find that 

L-l 

A:=0 



lxy{k) — 



Sxy (^) 



s/ Sxx{k)Syy{k) 



(6.32) 



(6.33) 



(6.34) 



where 



(6.35) 
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is the discrete coherence function. Thus, asymptotically we have 






where H{k) is the transfer function of h and 

<k) = > 0 (6.37) 

is the near-end talker to far-end talker spectral ratio at frequency k. Except 
for an unrestricted frequency range, this form is identical to the coherence- 
based double-talk detector presented in Section 3.4. We find that this idea is 
very appropriate since when v{n) — 0, the two signals x and y are completely 
coherent and then | 7 a;j,(A:)| = 1,VA:, and « 1; when u ^ 0, | 7 a;y(A:)| < 
1, Vfc, and < 1. 



L-l 



, I T E 

\ k=0 



L-t 






tE 



\H{kW 



Lf^^\H{k)\^ + K{ky 



(6.36) 



3.6 THE TWO-PATH MODEL 



An interesting approach to double-talk handling was proposed in [13]. This 
method was introduced for network echo cancellation. However, it has proven 
far more useful for the AEC application. In this method, two filters model 
the echo path, one background filter which is adaptive as in a conventional 
AEC solution and one foreground filter which is not adaptive. The foreground 
filter cancels the echo. Whenever the background filter performs better than 
the foreground, its coefficients are copied to the foreground. Coefficients are 
copied only when a set of conditions are met, which should be compared to 
the single statistic decision declaring “no double-talk” in a traditional DTD 
presented in the previous sections. 

The basic set of conditions found in [13] are given by (6.38)-(6.40). Copying 
is performed, equivalent to no double-talk present, if any of (6.38)-(6.40) is 
fulfilled, 



^(y) 



Pje^) 

P(y) 

P{eb) 

P(ef) 

Pjy) 

P{x) 



< Ty 

<Pe 
< 1 



(6.38) 

(6.39) 



(6.40) 
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Figure 6.3 Two-path adaptive filtering. The adaptive filter estimates the impulse response hi,. 
If, according to some criteria, hb is determined to be a better estimate than the earlier estimate 
h[, the coefficients in hb are copied to ftf. The latter is then used to calculate the residual echo 
signal ef(n). 



where P(a) is the short time smoothed absolute magnitude of a signala(n), 

Z/TP “ 1 

P(a(n)) = E |a(n-/)|. (6.41) 

1=0 

A hangover time Thoid is also imposed when (6.40) is fulfilled. The last condi- 
tion (6.40) is basically the same as in the Geigel DTD with a unity threshold, 
i.e., the echo path is assumed not to attenuate the far-end speech. If all three 
conditions are satisfied over D consecutive decisions, copying of background 
coefficients is resumed. 

Condition (6.38) ensures the background adaptive filter is canceling echo, 
while condition (6.39) ensures the background filter is outperforming the fore- 
ground filter. The above decision logic is effective for certain applications, 
but is not without shortcomings. First, conditions (6.38) and (6.39) are not 
always sufficient to prevent coefficient transfer in the presence of double-talk 
and/or high background noise. For speech or any other non-spectrally diverse 
excitation, the inequalities in (6.38) and (6.39) can be satisfied in the short term 
(over duration D for example) even though the actual misalignment error of 
the background coefficients is worse than that of the foreground coefficients. 
Second, (6.38) and (6.39) employ thresholds that limit the responsiveness of the 
logic to changes in the performance of the background canceler and to changes 
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in the physical echo path. Condition (6.38) requires the background canceler to 
achieve a certain degree (18 dB in [13]) of cancellation before the foreground 
can be updated. In the presence of an echo path change, for example, (6.38) can 
prolong the presence of annoying echo. The threshold in (6.39) ensures that the 
foreground is updated only in steps, in effect quantizing the convergence trajec- 
tory of the echo canceler. Last, condition (6.40) ensures no update is performed 
unless )ty(n)| < |x(n)|. But, this property can be used to inhibit adaptation only 
in cases for which the physical echo path introduces signal loss (h^h < 1). If 
the echo path introduces gain (h^h >1), condition (6.40) prevents adaptation 
even in the absence of near-side speech and noise. For this reason, these rules 
cannot in general be used in echo-canceling speakerphones, where h^h > 1. 

3.6.1 A Threshold-Free Decision Logic. In addition to the beneficial 
aspects of the original two-path logic, a two-path canceler decision logic should 
possess the following characteristics: 

■ Faster initial convergence and reconvergence following echo path 
changes. 

■ Applicability to echo paths having signal gain (h^h > 1). 

■ Reduced dependence upon user-selected constants, such as thresholds 
and timers. 

A logic that exhibit these properties is proposed and described in detail in 
[22]. This decision logic differs from that of prior works in that it does not use 
decision thresholds (constants). A smoothing parameter is the only constant that 
has to be chosen. Moreover, the logic applies to both lossy and gain-incurring 
echo paths, and possesses favorable convergence properties for many scenarios 
encountered in practice. Hence, the great advantage with this algorithm is that 
it is not sensitive to echo path changes since the background filter is allowed to 
track changes freely and as soon as it performs better than the foreground it is 
copied over. 

3.7 DTD COMBINATIONS WITH ROBUST 
STATISTICS 

All practical double-talk detectors have a probability of miss, i.e. Pm 7^ 0. 
Requiring the probability of miss to be smaller will undoubtedly increase the 
probability of false alarms hence slowing down the convergence rate. As a 
consequence, no matter what DTD is used, undetected near-end speech will 
perturb the adaptive algorithm from time to time. Figure 6.4 shows the remain- 
ing undetected near-end speech (double-talk) after double-talk detection with 
a Geigel detector with T - 2. The impact of this perturbation is governed by 
the echo to near-end speech ratio as described in Section 1 . 
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Figure 6.4 (a) Far-end speech, (b) Near-end speech, i.e. double-talk, (c) Near-end speech 
gated with the decision of the DTD. These are the disturbances that actually enters the adaptive 
algorithm. Average far- to near-end ratio: 6 dB (1.125-3.625 s). 



In practice, what has been done in the past is, first the DTD is designed to be 
“as good as” one can afford and then, the adaptive algorithm is slowed down so 
that it copes with the detection errors made by the DTD. This is natural to do 
since if the adaptive algorithm is very fast, it can react faster to situation changes 
(e.g. double-talk) than the DTD and thus can diverge. However, this approach 
severely penalizes the convergence rate of the AEC when the situation is good, 
i.e. far-end but no near-end talk is present. 

In the light of these facts, it may be fruitful to look at adaptive algorithms 
that can handle at least a small amount of double-talk without diverging. This 
approach has been studied and proven very successful in the network echo can- 
celer case [23], where the combination of outlier resistant adaptive algorithms 
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and a Geigel DTD were studied. For the acoustic case, one could use any 
appropriate DTD and combine it with a robust adaptive algorithm, 

The approach can be exemplified by a robust version of the NLMS algorithm: 



h(n) 



+ 



h{n — 1) 

Mpx(n) [ |e(n)| 
x^(n)x(n) + 5^ s(n — 1) 



sign[e(n)] 5(n - 1). (6.42) 



As for any AEC/DTD, adaptation is inhibited by setting the step-size parameter 
to zero when double-talk is detected. The scaled non-linearity in (6.41) 
can be chosen to be the limiter [24], 






' le(n)! ' 
. «(«) . 



= min 



X^)l 




(6.43) 



where s(n) is an adaptive scale factor. Making the scale factor adaptive and 
supervised by the DTD is the key to the success of this approach. The scale 
factor should reflect the background noise level at the near-end, be robust to short 
burst disturbances (double-talk) and track long term changes of the residual 
error (echo path changes). To fulfill these requirements one can choose the 
scale factor estimate as 



s(n) 



Xs(n - 1) + 



1-A 



s(n - l)t/> 



|g(»)l 

s(n - 1). ’ 



(6.44) 



where = ax- Adaptation of s(n) is performed as long as the DTD has not 
detected double-talk. Justification and details of the above derivations can be 
found in [23]. 



4. COMPARISON OF DTDS BY MEANS OF THE ROC 

In this section, we present receiver operating characteristics of three DTDs, 
namely, the Geigel detector, the cross-correlation detector, and the normalized 
cross-correlation detector. As reference, we also show the region of operation 
for the “threshold free” two-path logic described in Section 3.6. 

Estimates of Pm and Pf were obtained according to the procedure in [6] 
and we present the ROCs estimated from speech as well as stationary synthetic 
data. The speech data we use contain sentences from three male and two female 
talkers. Furthermore, all sentences/synthetic data are also normalized to have 
the same average power level. 

Simulation details are as follows: 

Echo path. The echo path used is a measured acoustic response between 
the left loudspeaker and a standard cardioid microphone positioned on top 
of a workstation. The original impulse response has a length of 256 ms. 
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consisting of 4096 coefficients at a 16 kHz sampling rate. However, it 
was subsequently decimated to an 8 kHz sampling rate, resulting in 2048 
coefficients. It is also normalized so that = a'^ for the actual speech 
and synthetic data. 

Probability of false alarm. When estimating the probability of false 
alarm we use sentences from five talkers as far-end speech. These consist 
of speech sequences of 21.8 seconds at an 8 kHz sampling rate. The echo- 
to-ambient-noise ratio, ENR = E set to 1000 (30 dB). 

Probability of miss. The probability of a miss is estimated using 5 
seconds of far-end speech from one male talker. As near-end speech 8 
sentences are used, each about 2 seconds long. In this case, we investigate 
the performance when the average echo-to-background ratio EBR = 
ayJ{Oy + cr^) = 1 (OdB), since it is natural to assume equally strong 
talkers. 

The simulation conditions for the case with synthetic data are equivalent to those 
of the speech case. The synthetic sequences [far-end, near-end (double-talk), 
and ambient noise] are all white Gaussian distributed and mutually independent. 
The synthetic data enables us to assess the influence of in-stationarity of speech 
(e.g. the instantaneously varying EBR/ENR). 

A hold time (Thoki) of 30 ms (240 samples) is used in all three detectors. 
The thresholds of the detectors are chosen such that their probability of false 
alarm is in the range ofO to about 1. Eor these thresholds, we then estimate the 
corresponding probability of detection (see [6] for details). Since the two-path 
method is threshold free, we instead vary the smoothing parameter over a range 
of practical values and estimate the resulting probabilities. These probabilities 
are presented together with the ROCs of the other detectors. The smoothing 
parameter (time constant) is varied from 0.1 to 0.5 s in steps of 0.1 s. 

Results from these simulations are shown in Eig. 6.5. These results are 
consistent with those reported in [6] and, by the use of the ROC, it is possible 
to set thresholds such that the DTDs are compared fairly. In general, 
increases and P{ decreases when we compare the ROC curves estimated using 
speech versus the ROC curves estimated using synthetic signals. We also find 
that the ROC of the Geigel detector is more sensitive to this signal condition 
change than the ROCs of the other detectors. It is clear from these results that the 
normalized cross-correlation detector has superior performance compared to the 
two others. Also in this figure, the operation region for the threshold free two- 
path implementation is shown. The two-path method takes into account whether 
or not it is beneficial to update the adaptive algorithm for the specific data set. 
Hence, the estimated probability of false alarm increases and this should be 
taken into account when interpreting the results. The lowest probability of miss 
is attained when the smoothing parameter (time constant) is 0.5 s. 
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(a) (b) 





Figure 6.5 Receiver operating characteristic (ROC) (a) based on speech, (b) based on synthetic 
data. Double-talk detectors: Geigel (x), cross-correlation (o), and normalized cross-correlation 
(•). Region of operation for the threshold free two-path logic is indicated by (*). EBR = OdB 
andENR = 30dB. 



5. DISCUSSION 

In this chapter, we have presented double-talk detection algorithms suitable 
for acoustic echo cancellation. Because of the often unknown attenuation and 
the continuously time-varying nature of acoustic echo paths, devising an appro- 
priate DTD is more challenging than in the network echo canceler case. There 
are basically two types of double-talk detectors. First, those which form their 
test statistics from estimated level or power of far-end, near-end including echo, 
or residual echo signals. Secondly, detectors that make their decisions from 
cross-correlation or coherence estimates of the same involved signals. In this 
group we also find detectors utilizing the estimate of the echo path since these 
estimates are derived through cross-correlation as well. Double-talk detectors 
based on cross-correlation techniques exhibit desirable properties needed for 
the acoustic case. Mainly, they have very low sensitivity to the attenuation of 
the echo path. However, a problem that has to be considered and designed for, 
is the longer response time which may result. This is due to the fact that good 
(low variance) test statistics need to be based on a large amount of data. 

The ideal double-talk detector should be insensitive to echo path variations, 
have equal performance whether the echo canceler has converged or not, have 
quick response time and be sensitive to low near-end speech levels. More- 
over, the DTD must not slow down convergence rate of the AEC which can 
result either from erroneous decisions or introduction of delays. Some of these 
properties can be characterized by probability of detection and false alarm. 
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Notes 

1. Originally from [1]. This quotation was borrowed from [2]. 
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Abstract A software application has been designed that runs a stereophonic acoustic echo 
canceler natively under Windows operating systems on personal computers: the 
WinEC. This is a major achievement since echo cancelers require that the sound 
card’s input and output signals are time-synchronous. Synchronizing the audio 
streams is a great challenge in such an “asynchronous” environment as the operat- 
ing system of a PC. Furthermore, stereophonic echo cancellation is significantly 
more complicated to handle than the monophonic case because of computational 
complexity, nonuniqueness of solution, and convergence problems. In this chap- 
ter we present the system design and the core algorithms we use. This system has 
been evaluated in point-to-point as well as multi-point communication scenarios. 
We regularly use the software for teleconferencing in wideband stereo audio over 
commercial IP networks. 
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Keywords: Real-Time Implementation, Hands-Free, Stereophonic, Acoustic Echo Canceler, 

VoIP 

1. INTRODUCTION 

Real-time echo cancellation requires a significant amount of computational 
resources. From a computational point of view, real-time implementation has 
usually been realized using custom-designed very large scale integration (VLSI) 
circuits or digital signal processors (DSPs) [1]. These processors are specif- 
ically designed for signal processing tasks. They provide parallel processing 
of operations and optimized pipeline structures. However, since the computa- 
tional power of personal computers (PCs) has increased tremendously in the 
last few years, it is possible to perform very demanding real-time signal pro- 
cessing in this environment as well. Moreover, the PC environment permits the 
use of high-level programming languages, like C-i-i-, without the restrictions 
commonly imposed by DSPs, such as fixed-point arithmetic. The resulting 
source code can be easily used for implementing new algorithms and testing 
them in real-time without the need to port to special hardware. Furthermore, 
modern PC processors have SIMD (single instruction, multiple data) processing 
capabilities which can be used to speed optimize the program. 

The objective of this chapter is to present a flexible echo-cancelling speak- 
erphone algorithm that runs natively under the operating system (OS) on a PC. 
The additional hardware needed to support hands-free communication on a PC 
is a full-duplex capable sound card and a network adaptor, like a modem or 
an ethernet card. Depending on the desired operation mode, a mono or stereo 
microphone and loudspeakers are needed. For all, off-the-shelf hardware can 
be used. 

This work was done when the authors were with Bell Labs, Lucent Tech- 
nologies in the year of 2000. The system has previously been presented in 
[2, 3, 4]. Many of the original underlying research results can also be found 
in [5]. The echo canceler implementation provides the capability of com- 
municating hands-free in single-channel mode (receive one and transmit one 
audio-stream), synthetic-stereo mode (receive two and transmit one stream), or 
full stereo mode (receive two and transmit two audio-streams). In the full stereo 
case, natural stereo is transmitted to the receiving side. In the synthetic case, 
synthesized stereo [6] or 3D-audio [7] is generated from the mono audio stream 
at an intermediate conference server. The bandwidth of the audio is 8 kHz. 
To accommodate different acoustic environments, the echo canceler can span 
acoustic paths of lengths 32, 64, 128, or 256 ms. 
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Figure 7.1 Block diagram of a generic two-channel acoustic echo canceler. 



1.1 SIGNAL MODEL 

A block diagram of a two-channel, point-to-point speech communication link 
with one’ echo canceler is shown in Figure 7.1. We denote the signals picked up 
by the microphones in the transmission room by a:i(n), X 2 {n), and the return 
signal picked up by one of the microphones in the receiving room by y{n). The 
receiving room signal is in general composed of echo, ambient noise w{n), 
and possibly receiving room speech v{n). Hence, we have the receiving room 
signal model: y{n) — ye{n) -t- v{n) + w(n), where ye(n) = hp * Xp{n) 

is the echo, * denotes convolution, and hp, p ~ 1,2, are the receiving room 
echo paths. 

2. SYSTEM DESCRIPTION 

Figure 7.2 shows a block diagram of the entire software architecture for the 
single-channel (mono) case running on Microsoft Windows OS. This system 
primarily consists of three components: the audio module, the echo cancellation 
module, and the network module. An overview of these modules follows. 

2. 1 THE AUDIO MODULE 

The audio module is an interface between our software and the Windows 
DirectX interface [8]. DirectX provides a general interface between the Win- 
dows OS and different sound card drivers. The Windows DirectX interface is 
relatively well defined and stable. However, a significant concern is the so- 
called device driver between this interface and the actual sound card hardware. 
This driver is designed by the manufacturer of the sound card hardware and it 
is difficult to predict how it interacts with the Windows OS. 
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The key problem encountered when implementing an echo canceler on a PC 
is loss of synchronization of the audio streams. This causes instantaneous delay 
variation in the actual echo path, and the canceler cannot track these changes 
fast enough to achieve proper cancellation. There are two primary problems 
that must be solved to achieve stable echo cancellation performance. The con- 
sequences of, and solutions to, these problems are shown in the following. 

2.1.1 Stream synchronization problem caused by OS sample-rate con- 
version. On a multitasking OS, more than one application can generate sound 
at a time. Thus, the OS must mix the different audio streams together. Since 
the streams can only be mixed if they have a common sample rate, it is nec- 
essary to apply sample-rate conversion. To avoid loss of sound quality, the 
common sample rate is chosen to be the highest sample rate of the different 
sound streams. 

The problem for the sound card driver software, which realizes the sample- 
rate conversion, is that it is usually restricted to the use of certain block sizes. To 
implement an exact conversion between two sample rates, it is often necessary 
to change the input or output block sizes with time, e.g., when converting 
from 44.1 kHz to our sample rate of 16 kHz is necessary. To simplify the 
implementation, some manufacturers tend to omit or insert samples to keep the 
block sizes constant. In this case synchronization between input and output 
streams suffers from a constant phase drift, or jitter, which results in poor echo 
cancellation. 

The best way to resolve this problem is to take complete control over sample- 
rate conversion by performing it at the application layer. To make sure the driver 
does not modify the signal, the application up-samples to the highest sample 
rate offered by the sound card. As such, no other sample rate can be chosen 
by the OS without incurring loss of audio quality, and the sound card does not 
need to apply sample-rate conversion to play and record the streams. 

2.1.2 Synchronization failure following temporary bursts of CPU uti- 
lization. The general flowchart of a block-oriented, full-duplex audio appli- 
cation consists of three boxes: a sound card read box, which queries the audio 
data from the sound interface; a signal processing box where echo cancellation 
is done; and a third box that writes the processed data to the sound interface. The 
first box usually holds the data flow until a new input data block is captured by 
the sound card. To be sure that the application runs stable even at nearly 100% 
CPU usage, the minimum audio latency must not be shorter than the duration 
of one data block. Therefore, the third box writes the processed data exactly 
one block further than the play position at the time a new block is read by the 
sound-in box. Unfortunately, if the operating system reaches the CPU limit, the 
difference between read and write position in the buffers changes abruptly and 




176 Audio Signal Processing 



stays steady at a new value. This phenomenon occurs on many sound cards. To 
comhat this situation, a precise measurement of the current read and write posi- 
tion must he available to correct the synchronization failure. Unfortunately, the 
buffer read and write positions reported by the sound interface are inaccurate. 
In this implementation, the positions are queried often during regular operation 
and averaged to reduce the variance of the measurement when a failure happens. 

2.2 THE NETWORK MODULE 

The network module controls the data transfer between the two connected 
clients. Its main tasks are buffering the different network-side audio streams 
and compressing (audio/speech coding) the audio data, if desired. 

To keep the actual network interface as simple as possible and to avoid 
excessive overhead due to additional headers, the Windows Socket interface 
is used to transmit the data through the network as a user datagram protocol 
(UDP) packet. This interface deals with all protocol tasks and requires only a 
port number and the IP address of the receiving client. At the receiving client 
the different audio modes (sample rate, compression mode) are distinguished 
by the block size of the incoming data. To detect missing or repeated packets, 
a modulo 256 counter is added at the beginning of the actual audio data and is 
incremented each time a new packet is sent. 

The socket buffer is basically a FIFO (first in, first out) buffer which acts as 
a cache between the clients. It deals with three tasks: synchronization of data 
blocks, adjustment to different block sizes, and correcting for buffer underflow 
because of network problems. The latter two points are effects of the network 
and cannot be predicted whereas the first one is caused by the client itself. Since 
two clients are in general not synchronized, there is a small offset between the 
sample rates of the sound cards. As a result, one client transmits more packets in 
a certain time period than the other which results in a permanent drift in the FIFO 
buffer causing over- or under-runs. In this implementation, we simply detect 
these errors in the buffer and delete extra packets or insert packets filled with 
zeros to clear or refill the buffer. It is clear that these corrections will introduce 
audible effects, but if the buffer length is relative long and the sample rate offset 
is not too high, corrections are rarely necessary. On the other hand, enlarging 
the buffer will raise the overall audio latency of the connection. Therefore, the 
buffer size presents a trade-off between audio latency and immunity against 
synchronization problems. A more sophisticated solution would be to apply 
sample-rate correction to the incoming audio stream. In this case, however, the 
sample-rate offset must be estimated and the sample-rate conversion algorithm 
must be capable of converting to arbitrary sample rates. 
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The application uses an audio compression algorithm [9] that provides com- 
pression rates of 32, 24, or 16 kbit/s. Compression reduces the network load 
and makes a communication link possible even through slow analog modems. 

2.3 THE ECHO CANCELER MODULE 

The core echo canceler consists of a robust two-channel frequency-domain 
adaptive algorithm, a pseudo-coherence-based double-talk detector, and a resid- 
ual echo suppression unit. The choice of this solution is based on the need for 
a low-complexity algorithm, as compared to a time-domain solution, as well 
as the need to handle the problem of slow convergence in the two-channel 
case. This could be achieved with a subband solution as in [10], however, a 
more complicated adaptive algorithm (fast recursive least squares) would be 
required. The frequency-domain solution has been shown to provide a rea- 
sonable trade-off between complexity and performance [11, 5]. The specific 
problem of stereophonic echo cancellation, i.e., the nonuniqueness problem, 
is handled by nonlinear distortion as described in [12]. Details regarding the 
adaptive algorithm and double-talk detector are presented in the next sections. 
The suppression algorithm attenuates the residual echo e(n) (Fig. 7.1) depend- 
ing on the actual amount of echo cancellation. The adaptation of the attenuation 
is based on voice activity detection decisions and an echo-retum-loss measure- 
ment. All the measurements are based on the envelopes of the speech signals 
and the noise. 

If the dynamic range of the received signal is near full range, unmodeled 
nonlinearities in the echo path will likely be excited. Therefore, compression 
of the incoming far-end signal’s amplitude is done. 

Since the analog input signal for the sound card must be amplified with 
analog circuits prior to analog-to-digital (A/D) conversion, it is most likely that 
the captured signal carries an unwanted DC component. This DC offset can 
lead to performance degradation or even to instability of the adaptive algorithm 
and must therefore be removed with a high-pass filter. 

3. ALGORITHMS OE THE ECHO CANCELER 
MODULE 

In this section, the robust two-channel frequency-domain adaptive algorithm 
and a frequency-domain-based double-talk detector are presented [5]. Further- 
more, a sub-band noise and residual echo suppression structure is outlined. 

For the stereo case we have two incoming transmission room signals, 
a:p(n), p = 1,2, where the input excitation vectors are defined as 

Xp(n) = \xp(n) Xp(n-l) ••• Xp{n - L 1) , 

P = l,2. 
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The error signal at time n between one arbitrary mierophone output y{n) in the 
receiving room and its estimate is 



2 

e(n) =y(n) - 5 ^yp(n) = 

p=i 

hp ^p,0 



yi'n) - 

hp,L—\ 
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J]hpXp(n), 



p=i 




1 , 2 . 



In general, we have two microphones in the receiving room, i.e. two return 
channels, and thus four adaptive filters. For simplicity in the derivation, only 
one return channel is considered; the derivation is similar for the other channel. 
Furthermore, the following block signals are defined: 



e(m) = [e(mL) ••• e(mL -(- L — 1 )]'^, ( 7 . 1 ) 

y(m) = [y(mL) • • • y(mL + L - 1 )]'^, ( 7 . 2 ) 



where the block length is chosen to be equal to the length of the adaptive filter 
L and m is the block time index. 

3. 1 ADAPTIVE FILTER ALGORITHM 

As with time-domain adaptive filter algorithms, the derivation begins with 
forming a criterion that is minimized with respect to the filter coefficients. Most 
commonly, the choice is a quadratic criterion that corresponds to a maximum 
likelihood estimator when the underlying noise distribution is Gaussian. Here, 
a maximum likelihood criterion derived from a non-Gaussian noise assumption 
is used. Modeling the noise with a probability density function (PDF) having 
a tail that is heavier than the Gaussian PDF gives a non-quadratic function to 
minimize, which results in an outlier-robust algorithm [13]. The following 
criterion is used: 



^(h) 



mL-|-L — 1 

E 



n=mL 




( 7 . 3 ) 



where p [•] is a convex function and s is a real positive scale factor as discussed 
below. 
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Minimizing (7.3) yields the robust two-channel frequency-domain adaptive 
algorithm [5, 14]: 
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where F is the Fourier matrix and 
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/i' = Ml-A), (/X, A, A,)g[0, 1]. (7.12) 



3.1.1 Double-talk Detection. A frequency-domain computation scheme 
for the normalized cross-correlation douhle-talk test statistic is presented in [5]. 
The decision variable is given hy 

i = {cT^Sy's=\\ (xt2s)“'/'s|| 2 = ||c^j,||2, (7.13) 

where 



S = E [D^(m)G°‘D(m)] , (7.14) 

is the spectral matrix ofthe transmission room signal, ands = E [D^(m)y(m)] 
is a cross-spectral vector between the transmission room and receiving room 
signals. The vector Cxy is referred to as the pseudo-coherence vector [15]. 
Looking at (7.13), each cross-spectrum bin of s is normalized by the corre- 
sponding spectrum of the input signal, x. What differentiates (7.13) from being 
the true coherence is that it is not normalized by the corresponding spectmm of 
the output signal y but by the total power of the output signal, Gy. A practical 
double-talk detection statistic can now be realized by using estimated quantities 
in (7.13) and slightly re-writing the numerator. 






s^(m)S ^(m)s(m) _ s^(m)hj,(Tn) 
alim) cTy{m) 



(7.15) 



where the square of the statistic is used for simplicity, and hi^(7Tt) = 
S'"^(m)s(m) is defined as an equivalent “background” filter. The decision 
variable is obtained by using a separate filter for the DTD (not to be confused 
with the “foreground” estimate ofthe echo canceler). The nomenclature “back- 
ground” and “foreground” is borrowed from a double-talk method called the 
two-path method [16]. In some literature, the background filter is called the 
shadow filter [17]. Estimates of the quantities in (7.15) are given by 

s(m) = Abs(m - 1) -I- (1 - Ab)D^(m)y(m), 

S(m) = AbS(m-l) + (l-Ab)D^(m)G°^D(m), 

= Ab(T,^(m - 1) -f (1 - Ab)y^(m)y(m). 



(7.16) 

(7.17) 

(7.18) 
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As a simplification, the background echo path estimate £b(^) can be 

computed adaptively as 

hb(m) - hb(m- 1) + /ibK(m)eb(m), (7.19) 

Mb = (1-Ab), (7.20) 

where S^(m) is defined in (7.5) and 

®b(’^) = y(m) - G°^D(m)hb(m - 1). (7.21) 

The background estimate should be adapted with a smaller forgetting factor, Ab, 
than that of the foreground filter, A. By this choice, the DTD detects double-talk 
fast and alerts the foreground filter before it diverges. Table 7.1 summarizes the 
robust two-channel frequency-domain adaptive filter combined with the DTD. 

4. RESIDUAL ECHO AND NOISE SUPPRESSION 

The adaptive filter is the key component in echo cancellation. However, in 
the acoustic application, a linear echo canceler is often inadequate to provide 
sufficient cancellation such that the residual echo is inaudible. This is partic- 
ularly so in cases where there is a nonlinear imperfection in the echo path or 
the echo path changes because of motion in the acoustic path. Unfortunately, 
suppression impairs the “duplexness” (near-end speech transparentness) of the 
echo cancellation system. The problem presented by the use of an echo suppres- 
sor is how to trade off duplexness for satisfactory performance. Furthermore, 
it is often advantageous to improve the perceived quality by reducing the am- 
bient noise with a noise suppressor before transmitting the microphone signal. 
These two functions can be implemented jointly. The system of a (mono) echo 
canceler, residual echo and noise suppressor is shown in Figure 7.3. The resid- 
ual echo and noise suppressor is denoted by the time varying system function 
g{n). Finally, artifacts may be introduced by the suppression methods so it 
is customary to mask these distortions by adding “comfort noise,” u){n). The 
combined suppression and comfort noise injection operation can be described 
as 



eg(J^) = g{n) * e{n) + io{n). (7.22) 

In general, g{n) can be a time varying filter function of arbitrary order. Here, 
subband (or frequency-domain) implementations are used to allow independent 
processing of each frequency band as described in [18]. This choice casts the 
suppressor as a scalar attenuator, G, in each frequency subband^. 

The objective in this section is to present the achievable cancellation perfor- 
mance of the adaptive algorithm of the AFC (the results are also applicable to 
the algorithm presented in the previous section as well as any general gradient- 
based adaptive algorithm). Based on this result, relations are derived between 
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Table 7. 1 The two-channel PC double-talk detector and echo canceler. The parameter q bounds 
the maximum allowed coherence between the channels. 



Spectral estimation 

Sp, <,("!) = ASp - 1) 

+ (1 - A) D"(m)Dg(m), = 1,2 

Sp,p(m) = Sp p(m) + diagj^p^o • • ■ ^p,2L-t}, P = 1, 2 

|r^(m)l = Si,i(m)S2,2(m) S2_i(m)Si_2(m) 

Sp(m) = Sp,p(m)x 

[hLx2L - Q^\T^{rn)\] , p,q= 1,2 
Ki(m) = S^’(m) D|(m) — ^S^_2(?"^)S2,2("0 

K2(m) = S^^(m) D2(m) — pS2j(m)Si J(m) Dj(m) 

Double-talk detector (Background filter) 

§b(’^) = yi'^) ~ G°*D(m)hb(Tn - 1) 

hb,p(’''^) = hb,p(m - 1) -1- (1 - Ab)Kp(m)eb(m), p = 1,2 

s(m) = Abs(m - 1) + (1 - Ab) [m)y{m) 
r/2(m) = hb^j(m) h" 2 (rn) s(m) 

CT|(m) = Abcr|(m - 1) 4- (1 - Ab)y^^(m)y(m) 

^{m) — T]{m)/ay{m) <T, =» /i' = 0 
S,{rn) = r]{m) I Oy{m) > T, => p! = p{l — A) 



Echo canceler (Foreground filter) 

e(m) = y{m) — W°’F“*D(m)h(77i — 1) 
hp(m) = hp(m - 1) + ^~^Kp(m)Ft/» [e(m)] , p= 1,2 

^min 



s{m + 1) 



, s mL-|-T— 1 

As.s(rn) + (1 - As) ^ 'ip 

n= 7 TiL 



>(^)r 

s(m) 



adaptive algorithm parameters, and the residual echo suppression required us- 
ing perceptual knowledge of the human auditory system. An outline of a joint 
noise and residual echo suppression is then given. 
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Figure 7.3 Block diagram of the generic (mono) echo canceler, echo path, and noise/residuat 
echo suppression filter. 



4.1 MASKING THRESHOLD FOR RESIDUAL ECHO 
IN NOISE 

Guiding the design of any speakerphone algorithm is the necessity to atten- 
uate the residual echo to make it inaudible under the present noise situation. To 
address this requirement, an understanding of the masking effects of speech in 
noise is needed. Unfortunately, there is no rigid investigation for this specific 
case. However, cases in which noise masks a tone or vice versa have been thor- 
oughly investigated [19, 20]. For the relevant case of a pure tone in narrow -band 
noise (equal or smaller than one critical bandwidth), the tone becomes inaudi- 
ble if it is 3 dB or more below the noise level [19]. Speech or residual echo 
(filtered speech) is highly audible at this speech-to-noise ratio when average 
power level of speech is considered. However, speech compared to tones has a 
much higher peak-to-standard deviation ratio, i.e., the crest-factor of speech Cg 
is larger than that of tones. Denoting the variance of speech by if instead 
of using the average speech power one compares the “equivalent tone power of 
speech” (FTP) 
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with the noise variance the residual echo (speech) should be inaudible if 



or 



ETP 



< lO^Vio 









(7.24) 



Clean speech has a crest-factor of around 20 [21] but the authors’ experience 
is that residual echo has a lower value around 10. This means that (7.24) is 
in the range -20 dB (Cs = 10) to -26 dB (Cg = 20). We have performed 
listening experiments with residual echo in various levels of white noise that 
have confirmed this approximate required ratio (-20 dB) for inaudibility. 



4.2 ANALYSIS OF ECHO SUPPRESSION 
REQUIREMENTS 

In this section, the performance and requirements of the echo canceler and 
residual echo suppressor are quantified. Assume that the adaptive algorithm of 
the echo canceler has the generic update equation 

h(m) = h(m - 1) - 2/r(l - A)Ah(m), (7.25) 

where his the filter vector of length L and Ah(m) is the gradient of the criterion 
that we seek to minimize. The performance, i.e. accuracy, of the estimate is 
solely determined by the step-size factor 2]i(l — A) and the echo-to-background 
noise ratio (EBR). Analyses show [5, Chapt. 9] that the performance measured 
as excess MSE is given by 



ex. MSE 



“ EBR ~ Ko - EBR’ 



(7.26) 



where is the residual echo and a'y_^ is the (uncancelled) echo. The 

parameter Kq > 0 is related to forgetting factor A as 



A = 




(7.27) 



Equation (7.26) is also valid for the traditional normalized least-mean-square 
algorithm (NLMS) [22] if Kq = 2 [23]. Hence, after initial convergence, 
the performance of most echo cancelers can be quantified by the parameters 
H e (0, 1] and Kq > 1. 
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The ex. MSE only assesses the cancellation performance and not the audible 
residual echo (perceived performance). The perceived performance is quan- 
tified by the residual echo-to-noise ratio (RENR). The relation between the 
step-size parameters and RENR is found using the ex. MSE as 



ex. MSE 

Reordering (7.28) gives 



CT 



2 

e-w 



a 



2 

y-w 



Xo-EBR 



RENR 



cr; 



ot 



JL 

Ko' 



(7.28) 



(7,29) 



It is also worthwhile to note that the RENR can be approximately related to 
EBR and the misalignment (MIS) of the adaptive filter 



RENR 



l|h - h|P 

< l|h||2 



EBR. MIS. 



(7.30) 



The overall perceived performance of the AEC and suppression system is 
given by the total residual echo-to-noise ratio (RENRq). To capture this quan- 
tity in a simple expression, the suppression function g{n) in Eig. 7.3 is con- 
strained to be a scalar, G. Also, assume that the echo and noise have constant 
spectra over frequency. After suppression, the desired total noise variance is 
equal to a factor < 1 of the original noise variance. The natural noise 

is attenuated by and the amount of comfort noise added is controlled by a 
factor 7 ^ > 0. This leads to 

G^al + al = Ghlal, (7.31) 



and the overall residual echo-to-noise ratio after suppression and addition of 
comfort noise, RENRq, then becomes 



RENRo 



G'^al -h 0-2 Gl-ylol 



= RENR^, 

it 



(7.32) 



where Gex = G/G^ is the extra residual echo suppression. Reordering (7.32) 
results in 



To summarize: 



2 RENRq 
RENR ■ 



(7.33) 



■ The required suppression (6^ex) to achieve inaudible residual echo is 
given by (7.33). 
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■ RENR after convergence is given by the setting of the echo canceler 
parameter (7.29) or in general (when AEC has not converged) by (7.30). 

■ RENRo is given by psycho-acoustic considerations (7.24) and experi- 
mental results. 

■ 7 <^ defines the final noise level in the system. 

Eor practical convergence rates, i.e. for practical choices of ^ andRTo, RENR 
is approximately -10 dB. However, in (7.30), RENR may become larger than 
EBR (normally is 10 - 20 dB) when a severe echo path change occurs (MIS 
1 — 2 in this case). Under this condition, with 7 ^^ = 1 and RENRq — 26 
dB [from (7.24)], the residual echo attenuation required is 

GL(indB) = lOIogio( 72^^^«-26 - 20 = -46dB(7.34) 

4.3 NOISE AND RESIDUAL ECHO SUPPRESSION 

Based on the ideas above, residual echo and noise suppression algorithms 
are implemented in a subhand structure. Voice activity detection is performed 
in each subband of the echo and estimated echo signals. Depending on voice 
activity, statistics from these estimates are used for computation of a gain func- 
tion which is then applied to the residual echo. The algorithm distinguishes 
between situations of far-end speech only and near-end speech only. In the 
case of double-talk the residual echo (and near-end speech) is attenuated using 
a Wiener filter structure. 

5. SIMULATIONS 

Eor these two-channel simulations, two-channel speech recordings were 
used [10, 24]. The sampling rate is 16 kHz. Eor transmission room speech, 
stereo recordings from a male talker are used. The transmission room speech is 
pre-processed with a nonlinearity before being emitted in the receiving room. 
By doing so, the echo canceler converges to a stable, unique solution. The 
nonlinearity used is [ 12 ]: 

= ^1 + I [a;i + kill , 

^2 = X2 A- ^[X2 - \X2\] , 

where the subscript 1 or 2 denotes either left or right channel, respectively. Thus, 
the positive half-wave is added to the left channel and the negative to the right. 
The distortion parameter of the nonlinearity is «. Adaptive filter parameters 
are L = 1024 (64 ms), A = [1 - 1/(3L)]^^”, o = 4 (o is the overlapping 
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(•) 




(b) 




Figure 7.4 Effects of double-talk in ternts of mean-square error of the two-channel robust 
frequency-domain algorithm with the pseudo-coherence-based DTD. (a) Left-channel transmis- 
sion room speech (upper), receiving room speech (lower), (b) Results when double-talk is present 
between 4.3 and 6.2 s. The rectangles in the bottom of the figures indicate where double-talk 
has been detected. 



factor), fj. = 0.5, ko = 1.5, h(0) = 0 , £» = 0.99 (relaxation parameter, see 
Table 7.1). DTD parameters are Ab = [1 — 2/(3L)]^'^°, ENR = 1000 (30dB), 
T - 0.92. Table 7.1 presents the modified system of two-channel PC DTD and 
frequency-domain algorithm used in the simulations. 

Figures 7.4 and 7.5 show the behavior of the system during double-talk and 
after an echo path change by means of its mean-square error (MSE). The mean- 
square error is defined as 



MSE(n) = 



LPF{(e(n) -u(n)f} 
LPF{[y(n)-u(n)f}' 



From these curves, it can be concluded that the system is robust to double-talk, 
yet at the same time shows rapid convergence after echo path changes because of 
proper behavior of the douhle-talk detector in the different situations. Figure 7.6 
shows the MSE after echo path change (same as in Fig. 7.5), and the residual 
echo suppressor active. For the gain change. Fig. 7.6(b), there is a slight increase 
in returned residual echo. 
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(a) 




Figure 7.5 Effects of echo path changes in terms of mean-squareerrorof the two-channel robust 
frequency-domain algorithm with the pseudo-coherence-based DTD (no double-talk present). 
The echo paths in the receiving room change as: (a) Time delay of 5 samples at 2 s. (b) 6-dB 
increase of echo path gain at 2 s. The rectangles in the bottom of the figures indicate where 
double-talk has been detected (in this case, they are false alarms). 



(al 




Figure 7.6 Data and conditions as in Fig. 7.5 except the residual echo suppressor is active. The 
echo paths in the receiving room change as: (a) Time delay of 5 samples at 2 s. (b) 6-dB increase 
of echo path gain at 2 s. 
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6. REAL-TIME TESTS WITH DIEEERENT MODES 
OF OPERATION 

The authors have tested and evaluated the system in various environments 
and operation modes. Point-to-point communication in full stereo or mono has 
been made in an office environment, using a desktop or a laptop computer, and 
in a larger conference room, where the loudspeakers and microphones are far 
apart. Multi-point communication has been performed using a mix of laptop 
and desktop machines. 

6.1 POINT-TO-POINT COMMUNICATION 

In point-to-point communication, two clients (PCs) are directly connected to 
each other through a network. Tests with typical office environments showed 
that a sufficient impulse response length is 64 ms for achieving good echo 
cancellation. It is also possible to use the shorter span of 32 ms, but the com- 
munications quality decreases somewhat because the suppressor must act more 
aggressively to reduce residual echo. 

In a conference-room situation, the audio components are physically dis- 
tributed, hence the distance between the microphone and loudspeakers is in- 
creased. Furthermore, the room is usually larger than a regular office. In this 
case, a 32 ms filter will result in poor echo cancellation, especially in stereo 
mode where the tail effects become severe [12]. The minimum length of the 
estimated fdter that worked well was 64 ms, but an audible improvement was 
achieved with 128 ms. 

What characterizes a laptop audio system is the close arrangement of the 
microphone and the loudspeakers. Moreover, the loudspeakers are often of 
poor quality. There are also significant levels of noise originating from computer 
components (e.g., hard drive) located close to the microphone. In this situation, 
it is not possible to improve echo cancellation by simply increasing the impulse 
response length. A good choice is a short filter of length 32 ms. Other serious 
problems include the nonlinearity of the loudspeaker and the acoustic resonance 
of the laptop case, both of which cannot be properly modeled with a linear FIR 
filter of finite length. Also, the keyboard is often between the microphone and 
the loudspeaker and, if it is used, the impulse response rapidly changes all the 
time. To achieve good performance in this environment, accurate suppression 
is required. 

6.2 MULTI-POINT COMMUNICATION 

The conference server, also designed by the authors, is an audio bridge 
that connects many clients and creates synthesized stereo or 3D-audio streams. 
Each user that connects to the server will hear all other clients connected to the 
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Figure 7. 7 Recorded speech from first transatlantic teleconference in stereo, (a) Right channel, 
(b) Left channel. In order from top to bottom; Transmission speech (upper), echo and receiving 
speech (middle), and processed echo and receiving speech (lower). 



same server. Because returning echoes are summed at the server, this situation 
places tougher requirements on echo cancellation. Furthermore, because of 
the nonuniqueness problem, the most difficult situation for the echo canceler 
to handle is the synthetic stereo/3D-audio case. Therefore, all clients have to 
adjust the cancellation parameters (i.e. impulse response length, suppression 
parameters etc.) and the audio system very carefully. With synthesized stereo 
and four clients, the positions of the different speakers are well distinguished. 
With 3D-audio, the audio distribution can be improved. 

6.3 TRANSATLANTIC TELECONFERENCE IN 
STEREO 

The WinEC system has been successfully used for transatlantic point-to- 
point communication over commercial networks. We performed the first stereo 
conference on January 31, 2003 between Chatham, New Jersey, USA and Darm- 
stadt, Germany. In this test we used mono and stereo mode and audio in both 
coded (32/64 kbit/s) and uncoded (256/512 kbit/s) formats. Audio quality was 
outstanding (i.e. significantly better that regular phone toll quality) and very 
few network problems were experienced. The spatial properties of the audio 
really creates an increased sense of presence. A clip of the transmission speech, 
echo and receiving speech, and return speech are shown in Fig. 7.7. The com- 
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Figure 7.8 Computer screen with the WinEC application, Netmeeting (only used to show video 
of the participants, V. Fischer and T. Giinsler. Netmeeting audio has been disabled), and Internet 
Explorer running concurrently. While the latter participant is communicating hands-free with 
V. Fischer, he is also talking over a PSTN line with G. Elko, therefore the handset. 



puter screen with our WinEC application, Microsoft’s Netmeeting, and Internet 
Explorer is shown in Eig 7.8. We believe that this experiment represents the 
first ever transatlantic PC-based hands-free full duplex stereo conference over 
commercial IP networks. Our experience from this and subsequent evaluation 
sessions is that this technology is ready for commercialization. 

7. DISCUSSION 

In this chapter, we have described an implementation of a flexible stereo- 
phonic acoustic echo canceler. This implementation runs natively under Mi- 
crosoft Windows on a PC. The major obstacle with such a scheme is the syn- 
chronization problem of the input and output audio streams of the sound card. 
Without proper synchronization, good cancellation cannot be maintained. 

Evaluation of the echo canceler has shown that it achieves the theoretical 
bounds on performance (echo attenuation) which in general is approximately 
5 dB below the room noise level (algorithm parameter dependent). This per- 
formance is valid for an echo-to-noise ratio down to about 35 dB. In practice. 
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one cannot expect more cancellation because of the linear model mismatch, 
non-stationary room responses, and unmodeled tails of the responses. Attenu- 
ation of 20 to 35 dB is not sufficient since the round-trip delay of the system is 
fairly large, about 350 ms. This delay is mainly due to delay in the sound card 
and network interface and is a function of the involved buffers’ lengths (assum- 
ing insignificant network delay). Because of this, residual echo suppression is 
required and has been implemented. 

The application supports mono, natural full stereo, and synthetic stereo 
hands-free communication. Multi-point communication can be done in mono 
or synthetic stereo/3D-audio mode. 

Notes 

1. In areal-life situation, we need an echo canceler forthe“transmissionroom” as well. However, for 
simplicity, we chose to exclude it in the figure. 

2. The analysis is also valid for the time-domain function of order zero (scalar). 
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Abstract Time delay estimation has been a research topic of significant practical importance 

in many fields (radar, sonar, seismology, geophysics, ultrasonics, hands-free com- 
munications, etc.). It is a first stage that feeds into subsequent processing blocks 
for identifying, localizing, and tracking radiating sources. This area has made 
considerable advances in the past few decades, and is continuing to progress, 
with an aim to create processors that are tolerant to both noise and reverberation. 
This chapter reviews some recently developed algorithms for time delay estima- 
tion. The emphasis is placed on their performance analysis and comparison in 
reverberant environments. In particular, algorithms reviewed include the gener- 
alized cross-correlation algorithm, the multichannel cross-correlation algorithm, 
and the blind channel identification technique based algorithms. Furthermore, 
their relations and improvements are also discussed. Experiments based on the 
data recorded in the Varechoic chamber at Bell Labs are provided to illustrate 
their performance differences. 

Keywords: Time Delay Estimation, Time of Arrival Estimation, Time Difference of Ar- 

rival Estimation, Multipath Effect, Reverberation, Blind Channel Identification, 
Coherence Function, Cross-Correlation Function, Eigenvalue Decomposition, 
Multichannel, Adaptive Algorithms, LMS, Newton’s Method 
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1. INTRODUCTION 

Time delay estimation (TDE) has been an area of great research interest 
for many decades. It has plenty of applications in fields as diverse as radar, 
sonar, seismology, geophysics, nltrasonics, and communications for detecting, 
identifying, and localizing radiating sonrces. 

Depending on the natnre of its application, TDE can be dichotomized into 
two broad categories, namely, the time of arrival (TOA) estimation [1, 2, 3, 4] 
and the time difference of arrival (TDOA) estimation [5, 6]. The former aims 
at measnring the time delay between the transmission of a pnlse signal and the 
reception of its echo, which is often of primary interest to an active system snch 
as radar and active sonar; while the latter, as its name indicates, endeavors to 
determine the relative time difference of arrival between two spatially separated 
sensors, which is often of concern to a passive system snch as passive sonars and 
microphone array systems. Thongh there exists intrinsic relationship between 
the TOA and TDOA estimation, their essential difference is literally profonnd. 
In the former case, the “clean” reference signal, i.e., the transmitted signal, is 
known, snch that the time delay estimate can be obtained based on a single 
sensor generally using the matched filter approach. On the contrary, in the 
latter, no snch explicit reference signal is available, and the delay estimate is 
often acqnired by comparing the signals received at two (or more) spatially 
separated sensors. This chapter deals with the snbject of time delay estimation, 
with emphasis on the time difference of arrival. Erom now on, we will make 
no distinction between TDE and TDOA estimation nnless necessary. 

Measnring TDOA among different channels is a fnndamental approach to 
detecting, localizing, and tracking radiating sonrces. Over the past few decades, 
researchers have approached this challenging problem by exploiting different 
facets of the received signals. Some good reviews of snch efforts can be fonnd 
in [5, 7, 8, 9]. Enndamentally, the solntions to the problem can be categorized 
from the following points of view: 

■ The nnmber of sonrces in the field, i.e., single-sonrce TDE techniqnes 
[5] and mnltiple-sonrce TDE techniqnes [10, 11]. 

■ How the propagation condition is modeled, i.e., the ideal single -path 
propagation model [5], the mnltipath propagation model [12, 13, 14], 
and the convolntive reverberant model [15]. 

■ What analysis tools are employed, e.g., generalized cross-correlation 
method [5, 16, 17, 18, 19, 20], higher-order statistics (HOS) based ap- 
proaches [21, 22], and blind channel identification based algorithms 
[15, 23]. 

■ How the delay estimate is npdated, i.e., non-adaptive and adaptive [24, 
25, 26, 27, 28] approaches. 
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The vast majority of existing TDE algorithms deal with a single source. 
While estimating the time delay of multiple targets is an important issue, this 
work focuses on the single-source scenario only. The normal practice in such 
a task involves identifying the maximum of the generalized cross-correlation 
(GCC) function of the outputs of two sensors. This so-called GCC method con- 
sists of two prefilters followed by a cross correlator. The prefilters operate on the 
observation sequences in the frequency domain to control the TDE performance 
and their transfer functions are chosen according to some criterion. Some pre- 
filters are optimal in the sense that the estimation variance can approach the 
Cramer-Rao lower bound [such as the maximum likelihood (ME) approach]. 
Some are sub-optimal but possess special properties, for example, the ability 
to deal more efficiently with noise. The GCC approach is well studied and can 
produce reasonable performance in the single-path propagation situation. 

Recently, there has been a growing interest in exploring the TDE technique 
in a room acoustic environment where the time delay estimation becomes more 
complicated owing to the sophisticated mutlipath effect: in addition to the 
direct path, the source wavefront reaches the receiver after getting reflected off 
room boundaries such as walls and floors, and other objects in the room. This 
multipath effect introduces echoes and spectral distortion into the speech signal, 
which is termed reverberation. The GCC algorithm, and many other traditional 
methods tend to break down in such reverberant environments [29]. 

Much attention has been paid to combatting reverberation lately. Most of 
such efforts fall into two categories. The first is to use multiple (more than two) 
sensors and take advantage of the redundancy to improve the TDE performance. 
Two examples are the Eerguson’s method [30] and consistency-based method 
[31]. Eerguson’s approach is an extension of the GCC algorithm in which an 
array of sensors is divided into two subarrays and the delay estimate is accom- 
plished by cross-correlating the outputs of the two beamformers. Apparently, 
this method needs the direction of arrival (DOA) or TDOA as a priori infor- 
mation for subarray beamforming. Unfortunately, such information is hard to 
acquire if not available in reality. In [31], Griebel and Brandstein offered a 
consistency-based method where multiple sensors are partitioned into several 
pairs, and cross-correlation functions from different sensor pairs are fused to- 
gether in the final cost function to obtain the time delay. This method requires 
the sensors to be paired in such a way that each correlation function should 
have a peak due to the same source in the same location. We have recently 
proposed a TDE algorithm based on the spatial interpolation technique, which 
will be detailed later. This method is shown to be a natural generalization of the 
cross-correlation method. It can improve the TDE performance as the number 
of sensors increases, without any a priori knowledge. 

The second effort to deal with reverberation is to remodel the observation 
signals. In [15], aconvolutive model is proposed to describe the TDE problem. 
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Different from the traditional way, this new model takes into account not only 
the direct path, but all the reflections as well, whereby the received signal of 
each sensor is modeled as the convolution of the source signal with the channel 
impulse response from the source to the sensor. The TDE problem then is 
rooted on identifying the channel impulse responses from the source to the 
sensors. Also in [15], an adaptive eigenvalue decomposition based algorithm 
is introduced to estimate two channel impulse responses (blindly), and thus the 
TDOA between the two channels. This method yields a robust solution to the 
TDE problem in a reverberant environment when the two channels do not share 
common zeros. This method is further enhanced by a multichannel technique 
presented in [32]. 

The objective of this chapter is to review some recent developments made 
in the TDE research. The focus is placed on their performance analysis and 
comparison in reverberant environments. In particular, this chapter reviews the 
generalized cross-correlation algorithm, the multichannel cross-correlation al- 
gorithm, and the blind-channel identification technique based algorithms along 
with their relations and improvements. 

2. SIGNAL MODELS 

Three models have been employed to describe an acoustic environment in the 
TDE problem: the ideal single-path propagation model, the multipath model, 
and the reverberant model. 

2.1 IDEAL PROPAGATION MODEL 

This model assumes that the signal acquired by each sensor is a delayed 
and attenuated version of the original source signal plus some additive noise. 
Suppose that we have an array consisting of L -i- 1 receivers, the received signals 
can be expressed as: 
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(8.1) 

where a/, I = 0, ,L, are the attenuation factors due to propagation 
effects, t is the propagation time from the unknown source s[n] to Sensor 
0, Wi[n] is an additive noise signal at the (th microphone, r is the relative 
delay between Microphones 0 and 1, and /;(r) is the relative delay between 
Microphones 0 and 1. The function // depends not only on r but also on the 
microphone array geometry. Eor example, in the far-field case (plane wave 
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propagation), for a linear and equispaced array, we have: 

/i(T) = ir, (8.2) 

and for a linear but non-equispaced array, we have: 

m = (8.3) 

do 

where di is the distance between Microphones i and i + = - ,L — 1. 

In the near-field case, /; depends also on the position of the source. Also note 
that /i(r) can be a nonlinear function ofr for a nonlinear array geometry, even 
in the far-filed case (e.g., 3 equilateral sensors). In general r is not known, but 
the geometry of the array is known such that the mathematical formulation of 
/;(t) is well defined or given. It is further assumed that wi{n) is a zero-mean 
Gaussian random process that is uncorrelated with s(n) and the noise signals 
at other sensors. For this model, the TDE problem is formulated to determine 
an estimate f of the true time delay r using a finite set of observation samples. 

2.2 MULTIPATH MODEL 

The ideal propagation model takes into account the direct path signal only. In 
many situations, however, each sensor receives multiple delayed and attenuated 
replicas of the source signal due to reflections of the source wavefront from 
boundaries and objects in addition to the direct path signal. This so-called 
multipath effect has been intensively studied in the literature [13, 14, 33, 34]. 
In this case, the received signals are often described mathematically as: 

M 

+ 1 = 0,1, •••,!, (8.4) 

m=l 



where aim is the attenuation factor from the unknown source to the 1th sensor 
via the mthpath, t is the propagation time from the source to Sensor 0 via direct 
path, Tim is the relative delay between Sensor 1 and Sensor 0 for path m with 
tqi = 0, M is the number of different paths, and iU([n] is stationary Gaussian 
noise and assumed to be uncorrelated with both the source signal and the noise 
signals received by other sensors. This model is widely adopted in the oceanic 
propagation environments as illustrated in Fig. 8.1, where each sensor receives 
the direct path signal, as well as reflections from both the sea surface and the 
sea bottom [35, 36]. The primary interest of the TDE problem for this model is 
to measure t;i, 1 = 1, • • • , L, which is the TDOA between Sensor 1 and Sensor 
0 via direct path. 
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Figure 8.1 Illustration of the signal model in a multipath environment. 



2.3 REVERBERANT MODEL 

The multipath model is valid for some hut not all environments [37]. In 
addition, if there are many different paths, i.e., M is large, it is difficult to 
estimate all r/^’s in (8.4). Recently, a more realistic convolutive model has 
been introduced to describe the TDE problem in a room environment where 
each sensor often receives a large number of echoes due to the reflections of 
objects and room boundaries such as walls, ceiling, and floor [15]. In addition, 
reflections can occur several times before a signal reaches the array, as shown 
in Fig. 8.2. In this model, the received signals are expressed as: 

xi[n] = hi* s[n]+ wi[n], (8.5) 

where * denotes convolution, hi is the channel impulse response between the 
source and the Rh sensor, and again m/[n] is a noise signal. 

As seen, no time delay is explicitly expressed in (8.5), hence there is no plain 
solution to the TDE problem for the reverberation model, unless the channel 
impulse responses can be accurately (and blindly) identified, which is a very 
challenging problem. 

3. GENERALIZED CROSS-CORRELATION 
METHOD 

The generalized cross-correlation (GCC) algorithm is the most widely used 
approach for TDE, which is based on the ideal propagation model with two 
sensors, i.e., (8.1) with L = 1. In this framework, the delay estimate is obtained 
as 



face = arg max 4'occM, 

n 



(8.6) 
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Figure 8.2 Illustration of the signal model in a reverberant environment. 



where 

N-l 

k=0 

is the generalized cross-correlation function, 5xoxi[fc] = [^]} ^^e 

cross spectrum, £{■} and (•)* stand respectively for the mathematical expecta- 
tion and the complex conjugate operator, X{[fc] is the discrete Fourier transform 
(DFT) of Xi[n], $[fc] is a weighting function (sometimes called a prefilter), and 
N denotes the number of observation samples during the observation interval. 

The weighting function plays an important role in controlling the TDE 
performance. It is chosen according to some criterion. Commonly used weight- 
ing functions include unit weighting (the classical cross-correlation method), 
the smoothed coherence transform (SCOT) [38], the Roth processor [39], the 
Echart filter, the phase transform (PHAT), the maximum likelihood (ME) pro- 
cessor[5], the Hassab-Boucher transform [16], etc. Some of these are optimal in 
the sense that the estimation variance can achieve the Cramer-Rao lower bound 
(CREB). Others are suboptimal but possess special properties, as for example 
the PHAT algorithm where $phat[^] = 1/1'^xoxJfc]!- Substituting $pHAj[fe] into 
(8.6) and neglecting noise effects, one can readily deduce that the weighted 
cross spectrum is free from the source signal and depends only on the channel 
responses. Consequently the PHAT algorithm performs more consistently than 
many other GCC members when the characteristics of the source signal change. 
This makes the PHAT algorithm superior in many applications. 

GCC is a computationally efficient algorithm and is simple to implement. It 
performs well in the single-path propagation scenario when the signal-to-noise 
ratio (SNR) is high [5, 7, 16]. However, its performance degrades significantly 
when SNR drops below a certain threshold. This so-called threshold effect 
is also observed in a reverberant environment where the GCC method suffers 
sudden performance degradation when the reverberation time increases up to 
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around 0.15 s [29]. In addition, most weighting functions in the GCC family 
are dependent on the signal and noise spectra and, in the absence of this prior 
information, the spectra can only he estimated. Therefore even though cer- 
tain weighting functions are known to he optimal in theory, they can only he 
approximated in practice. 

4. THE MULTICHANNEL CROSS-CORRELATION 
ALGORITHM 

The fact that the GCC method does not perform well in reverberant envi- 
ronments motivated research to develop new algorithms. Two strategies were 
adopted. One is to blindly estimate the impulse responses from the source to 
various sensors, which will be discussed later. The other is to exploit the redun- 
dancy among different sensor signals in an array for robustness against noise 
and reverberation and will be developed here. 

4.1 SPATIAL PREDICTION TECHNIQUE 

As seen from the ideal propagation model given in (8.1), the signal of one 
sensor is not completely independent from the other sensor signals. The spa- 
tial prediction technique captures the dependence among signals and makes a 
prediction of a data sample of one sensor using samples from L other sensors. 
This concept was presented in [40] for the simple case in which the spatial 
prediction is made equivalent to the classical linear prediction. In this section, 
we generalize this idea in a way that the geometry of the array is taken into ac- 
count as well as the relative delay among the elements. As a result, the spatial 
correlation matrix has a much more general form. 

4.1.1 Linear Forward Spatial Prediction. We would like to align suc- 
cessive time samples of Microphone 0 signal with spatial samples from the L 
other microphone signals. It is clear that XQ[n — /i,(7')] is inphase with the 
signals xi[n — Jl{t) + //(t)], I = 1, 2, • ■ • , L. From these observations, we 
define the following forward spatial prediction error signal: 

eo[n - fiim)] = xo[n - fiim)] - - fL{m)]am, ( 8 . 7 ) 

where m is a guessed relative delay for r, {■)^ denotes transpose of a vector or 
a matrix, 

xi;L[n - him)] = 

[ xi[n - fhm) + fi{m)\ X2[n - fhm) + f2{m)\ X[^[n]] , 



and 
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is the linear forward spatial predictor. Consider the criterion 

Jmfi = E{el[n - fL{m)]}. 

Minimization of (8.8) leads to the equation: 



where 



Rm.l.i 






E{xi:L[n - fi,{m)]xJ,L[n - /z,(m)]} 
E{xj[n]} Rnim) ••• Riiim) 
R2i{m) E{x2[n]) ■■■ i?2Z,(w) 

Riiim) RL2(m) ■■■ E{x\[n]} 



( 8 . 8 ) 

(8.9) 



is the spatial correlation matrix with 

Rik{m) = E{xi[n - fk{m)]xk[n - /((m)]}, 
and 

rm,i:X = E{xuL[n- fL{m)]xo[n - /r.(m)]} 

E{xi[n - fL{m) + fi{m)]xo[n - him)]} 

E{x 2 [n - /L(m) + f 2 {m)]xo[n - fUm)]} 

E{xL[n]xo[n- fiim)]} 

E{xx[n]x(i[n - /r(m)]} 

E{x2[n]x()[n - /2(m)j} 

_ E{xi[n]xQ[n - /r,(m)]} _ 

is the spatial correlation vector. Note that the spatial correlation matrix is not 
Toeplitz in general, except in some particular cases. 

Form = T andforthe noise-free case(where wi[n] = 0, / = 0, 1, 2, ■ • • , L), 
it can easily be checked that with the ideal propagation model, the rank of matrix 
Rr,l:L is equal to 1. This means that the samples Xo[n — t] can be perfectly 
predicted from any one of the other microphone samples. However, the noise 
is never absent in practice and is in general isotropic. The noise energy at 
different microphones is added to the main diagonal of the correlation matrix 
Rt, 1 :L> which will regularize it and make it positive definite (which we suppose 
in the rest of this chapter). A unique solution to (8.9) is then guaranteed for any 
number of microphones. This solution is optimal from a Wiener theory point 
of view. 
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4.1.2 Linear Backward Spatial Prediction. We still assume the ideal 
propagation model but this time we consider Microphone L and we would like 
to align successive time samples of this microphone signal with spatial samples 
from the L other microphone signals. It is clear that xi[n] is inphase with the 
signals Xi[n — /lC?") + /;('f)]. ^ = 0, 1, • • • , L — 1. From these observations, 
we define the following backward spatial prediction error signal: 

eiin - /lM] = xi[n] - - /L(m)]bm, (8.10) 



where 

Xo:L-i[n - him)] = 

[ xo[n - f L{m) + fo{m)] xi[n - him) + fi{m)] 

XL-i[n - fiim) + fi-iim)] 
and 

bm [ bm,0 (*m,l ‘ ‘ ' — l ] 

is the linear backward spatial predictor. Minimization of the criterion 

Jm,L = E[e\\n - him)]} (8.1 1) 

leads to the equation: 

Rm,0:L— ibm rm,0:L-l) (8.12) 



where 

Rm,0:L-i = £^{xo:L-i[n - /z,(m)]xjj-^_i [n - fiim)]} 
and 

rm,0;L-i = -£^{xo;L-i[n ~ f Lim)]x L[n]} . 

4.1.3 Linear Spatial Interpolation. The ideas presented for spatial pre- 
diction can easily be extended to spatial interpolation, where we consider any 
microphone element /, / = 0, 1, 2, • • • , L. The spatial interpolation error signal 
is defined as 

ei[n - fiim)] = - /L(m)]c^,i, (8.13) 

where 

Xo:i,[n - Idm)] = 

T 

[ xo[n - fiim) + him)] xi[n - fiim) -P him)] ••• xi[n]] 



and 
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is the spatial interpolator with = — !• The criterion associated with (8.13) 
is: 

Jm,l = E{ef[n - /l(w.)]}. (8.14) 

The rest flows immediately from the previous sections on spatial prediction. 

4.2 TIME DELAY ESTIMATION USING SPATIAL 
PREDICTION 

The spatial prediction, and more generally the spatial interpolation technique 
can be applied to the problem of TDE. Here, we consider the linear forward 
prediction only. The idea can be easily generalized to the linear backward 
prediction and spatial interpolation. 

Let Jm,0;min denote the minimum mean-squared error, for the value m, de- 
fined by 

>^m,0;min ~ ~ /h(t^)]}- (8.15) 

If we replace by in (8.7), we get: 

eo;min[n ~ fiim)] = 

- fUm)] - - /£,(m)]R~Jj.^r^,i:L. (8.16) 

We deduce that: 

Jm,0-,min — ~ (8-17) 

The value of m that gives the minimum Jm, 0;min> for different m, corresponds 
to the time delay between Microphones 0 and 1 . Mathematically, the solution 
to the TDE problem is then given by 

f = argmin Jm,o;min, (8.18) 

m 



where f is an estimate of r. 

Particular case: Two microphones (L = 1). In this case, the solution is: 



f 



argmin | E{4[n - m]} [l - 

"‘I L E{xl\n - m])E{x\[n]} 

argimn{E{xl[n-m}} [l -p^,oi]} 



]} 



argimnjl 

argmax(p^ oi) > (8-19) 

m ’ 



where p^,oi (Pm oi ^ *^Le cross-correlation coefficient between Xo[n — m] 
and ii[n]. When the cross-correlation coefficient is close to 1, this means that 




208 Audio Signal Processing 



the two signals that we compare are highly correlated which happens when 
the signals are inphase, i.e., m ~ r, and this implies that Jr.Oimin ~ 0. This 
approach is similar to the GCC method. Note that in the general case with 
any number of microphones, the proposed approach can be seen as a cross- 
correlation method. However, we take advantage of the knowledge of the 
microphone array to estimate only one time delay (instead of estimating multiple 
time delays independently) in an optimal way in a least mean square sense. 

4.3 OTHER INFORMATION FROM THE SPATIAL 
CORRELATION MATRIX 

Consider the L -l- 1 microphone signals a:(, / = 0, 1, • • • , L. The correspond- 
ing spatial correlation matrix is 



Rm — R 



mfi-.L 



= E{xQ.,L[n - fL{m)]\li^[n - /lM]}, 
which can be factored as: 



Rm — DRntD, 



where 



D = 



0 



\/E{xl[n]} 



0 

0 



0 sjE{xl[n]} _ 



is a diagonal matrix. 



Rm — 



1 Pm, 01 

Pm, 01 1 



PmfiL 

Pm,\L 



PmfiL ■ ■ ■ Pm,L—lL 1 



is a symmetric matrix, and 

E{xk[n - fi{m)]xi[n - fk{m)]} 



( 8 . 20 ) 



(8.21) 



(8.22) 



(8.23) 



Pm,kl 



EixlinDEixfln]} 



, k, 1^0,1, ■■■ .L, (8.24) 



is the cross-correlation coefficient between Xk[n — /i(m)] and xi[n — /jfc(Tn)]. 
We now give two propositions that will he useful for TDE. 
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Proof. Since Rm is symmetric and is supposed to be positive definite, it is clear 
that det (Rm) > 0 which implies that det (Rm) > 0. To show that det(R^) < 
1, we can use the Cholesky factorization [41]. Since R^ is symmetric and 
positive definite, there exists a unique lower triangular matrix with positive 
diagonal entries such that = QmQm> where 



Qm = 



9m, 00 0 

9m, 10 9m, 11 



(8.26) 



9m,L0 



9m,LL-l 9m, LL 



It can be shown that the elements of the main diagonal of matrix can be 
computed as follows: 



~ \ ^ ^ — 0 , 1 , • • • ■,L. 



(8.27) 



It follows immediately from (8.27) that 0 < 9m,ll ^ li V/. Furthermore, since 
is a triangular matrix, we have: 

L 

det ^R,n^ = 9m,// — 

/=0 

which completes the proof. 

Another way to show this proposition is by induction, i.e., 

det ^Rm) = det ^Rm, 0 T) < det (^Rm,!:/.) < • • • < 1- (8.28) 



Proposition 2. The determinant of a cross-correlation coefficient matrix is 
bounded by 






>An,0;min 

E{4[n]} 



(8.29) 
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Proof. First define 

a„, = [ amfi aJn ] ^ • ( 8 - 30 ) 

Then, for o^.o — the forward prediction error signal defined in (8.7) can 
be rewritten as 

eo[n - /l("0] = (8-31) 

Continning, the criterion shown in (8.8) can be expressed as 

Jm,o = E{el[n - fUm)]} + + 1), (8.32) 



rp 

where 5 = [10 • ■ • 0] , and A is a Lagrange mnltiplier introdnced to force 
Om,0 to have valne - 1. It is then easily shown that: 






m,0;min — r7’o-ljc' 

d d 



(8.33) 



In this case, with (8.21), (8.33) becomes: 

E{xl[n]] 






m,0;min — 






(let R. 



= E{xl[n]}- 



det 






(8.34) 



Using (8.28), it is clear that Proposition 2 is verified. 

In the general case, for any interpolator, we have: 

det(Rj < < 1, / = 0,1,- - ,L. (8.35) 

V / E{xf[n]} 

As seen, the determinant of the spatial correlation matrix is related to the 
minimnm mean-sqnared error and to the power of the signals. Let’s take the 
two-channel case. It is obvions that the cross-correlation coefficient between 
the two signals xq and x\ is linked to the determinant of the corresponding 
spatial correlation matrix: 



Pm , 01 = 1 - det (^Rm,o-.i)- (8-36) 

By analogy to the cross-correlation coefficient definition between two ran- 
dom signals, we define the mnltichannel correlation coefficient among the sig- 
nals xi, ( = 0, 1, ■ ■ ■ , L,as: 

Prnfi-.l. ~ 3 ~ d(it ^Rm,0:ky 



(8.37) 
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From proposition 2, we give a new bound for 



1 ^ „2 / t 

i - -- S Pmfi-.L ^ 



E{xl[n]} 



(8.38) 



Basically, the coefficient pmfl-.L will measure the amount of correlation 
among all the channels. This coefficient has some interesting properties. For 
example, if one of the signals, say Xq, is completely decorrelated from the oth- 
ers because the microphone is defective, or it picks up only noise, or the signal 
is saturated, this signal will not affect Pmfi-.L since pmfil = 0, V(. In this case: 

Pmfi'.L = Pm,l:L- (8.39) 



In other words, the measure “drops” the signals that have no correlation with 
the others. This makes sense from a correlation point of view, since we want to 
measure the degree of correlation only from the channels that have something 
in common. In the extreme cases where all the signals are uncorrelated, we 
have = 0, and where any two signals (or more) are perfectly correlated, 

we have = 1. 

Obviously, the multichannel coefficient p^ can be used for time delay 
estimation in the following way: 

T = argmax (p^ Q.j;^) 

= arg min 

m 



det 






(8.40) 



This method can be seen as a multichannel correlation approach to the estimation 
of time delay and it is clear that (8.40) is equivalent to (8.18). 



5. ADAPTIVE EIGENVALUE DECOMPOSITION 
ALGORITHM 

Although the TDE performance can be improved by empolying multiple 
sensors, the multichannel cross-correlation method is still based on the ideal 
propagation model which takes into account merely the direct paths. Starting 
from here, we intend to approach this problem from a different direction, con- 
sidering the more realistic reverberant model and using the blind multichannel 
identification technique. Again we will first focus on an array with only two 
channels and develop the adaptive eigenvalue decomposition algorithm. Then 
we proceed to generalize the idea to multichannel cases. 

With the reverberant model given in (8.5) with two sensors, if the noise term 
is neglected, one can readily derive the following relation: 

a:o[n] * h\ = s[n] * ho * h\ = x\[n\ * ho- 



(8.41) 
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At time instant n, this relation can be rewritten in a vector-matrix form as [15]: 
x^[n]u = x^[n]hi - x'f [n]ho = 0, (8.42) 

where 

x/[n] = [ xi[n] a:;[n-l] o;/[n - M -f- 1] ]^ , 

h; = [ hifi ••• Y ) 

x[n] = [ xj[n] xHnl ]^, 

u = [h^ -hrf, 

/ = 0, 1, and M is the length of the impnlse responses. Left mnltiplying (8.42) 
by x[n] and taking expectation yields 

R[n]u = 0, (8.43) 

where R[n] = £?{x[n]x^[n]} is the covariance matrix of the microphone sig- 
nals. This implies that vector u which consists of two impnlse responses is in 
the nnll space of R[n] . More specifically, u is the eigenvector of R[n] corre- 
sponding to the eigenvalne 0. This remarkable observation forms the basis of 
the eigenvector based TDE algorithm [15, 42]. 

Before going fnrther, we need to know whether (8.43) has a nniqne solntion 
(np to arbitrary constant) other than the trivial solntion. Indeed, it was shown 
in [43] that it is the case if the following conditions hold: 

■ The polynomials Hq{z) and H\{z) are co-prime, or eqnivalently, they 
do not share any common zeros, where Hq{z) and H\{z) are the z- 
transforms of ho, hi respectively. 

■ The antocorrelation matrix of the sonrce signal s(n) isfnllrank. 

When an independent white noise is present on each of the two microphones, 
itwillregnlarize the covariance matrix; as a conseqnence, R[n] does not have a 
zero eigenvalne anymore. In snch a case, the TDE problem can be formnlated 
as to estimate u by minimizing u^R[n]u snbject to | |u| | = 1. This is equivalent 
to finding the normalized eigenvector associated with the lowest eigenvalne of 
R[n] [15]. 

With the backgronnd that we have developed, we are now in a position to 
treat the snbject of efficiently estimating the eigenvector corresponding to the 
smallest eigenvalne of R[n]. In principle, any eigenvalne decomposition algo- 
rithm can be nsed to solve the problem. Here we choose the constrained LMS 
adaptive algorithm [15] for its simplicity, efficiency, and ability to compen- 
sate for slow environmental changes. With snch a method, an estimate of u is 
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updated iteratively through: 



u[n + 1] 



u[ti] - iie[n\x[n] 
||u[ti] - /ie[n]x[n]||’ 



(8.44) 



and 

e[n] = u^[n]x[n], (8.45) 

with the constraint that |ju(n)lj = 1 , where /i, the adaptation step, is a positive 
constant. 

In practice, (8.44) may not produce an accurate estimation of the impulse 
responses because of the nonstationarity of speech, the background noise, and 
the unknown length of the impulse responses. However, it yields a solution that 
is accurate enough for the purpose of TDE since in such an application only 
two direct paths are of interest. 



6. ADAPTIVE MULTICHANNEL TIME DELAY 
ESTIMATION 

In the adaptive eigenvalue decomposition (AED) algorithm, the delay es- 
timate is obtained by blindly identifying two channel impulse responses. It 
requires that the two channels do not share any common zeros, which is usually 
true for systems with short impulse responses. In many application scenarios 
such as room acoustic environments, however, the channel impulse response 
from the source to the microphone sensor could be very long. As a result, the 
likelihood for two impulse responses not sharing common zeros tends to be low 
and the AED algorithm often fails when a zero is shared between two channels 
or some zeros of the two channels are close. One way to overcome this problem 
is to employ more channels in the system, since it would be less likely for all 
channels to share a common zero when the number of sensors is large. This 
idea leads to an adaptive multichannel time delay estimation approach based 
on a blind channel identification technique [32]. 

6.1 PRINCIPLE 

Generalizing the approach of Section 5 to more than two channels, we have 
in the noiseless case: 

Xi[n\ * hj — s(n) * * hj = Xj[v\ * hi, i, j = 0, 1, • • • , L, (8.46) 

and the vector-matrix form of cross relation between the Ah and jth channel 
outputs is 

xf[n]hj = xj[n]hj, i, j = 0, 1, 2, • ■ • , i 7 ^ j. (8.47) 

When noise is present or the channel impulse responses are improperly modeled, 
the left and right hand sides of (8.47) are generally not equal and the inequality 
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can be used to define an error signal at time n + 1 as follows: 

eij[n + 1] = 



xr[n + l]hj[n]-xJ[n + l]hi[n] .ox 

^ = 0, 1,-" ,L, (8.48) 



h[n] 



where 

/X /S /s. /N 

hi[n] = [ /ij_o[n] hi^i[n] ••• h^^M-\[n]] 

is the modeling filter for the ith channel at time n and 



h[n] = 



hoN h^W 



hlW 



The modeling filter is normalized in order to avoid a trivial solution whose 
elements are all zero. Based on the error signal defined here, a cost function at 
time n + 1 is given by 



L-i L 

4n+l] = EE efj[n + l]. (8.49) 

1=0 j—i+l 



The TDE problem is then to obtain the estimate of hi/l|h|| (i = 0, 1, • • • , L) 
that minimizes this cost function. 



6.2 TIME-DOMAIN MULTICHANNEL LMS 
APPROACH 

A straightforward approach to estimating channel impulse responses from 
the cost function defined in (8.49) is through the multichannel LMS (MCLMS) 
algorithm [44], which updates h through: 



h[n + 1] = h[n] - /tVJ[n + 1], (8.50) 

where ^ is a small positive step size. As shown in [44], the gradient of J[n + 1] 
is determined as: 



VJ[n + 1] = 



dJ[n + 1] 
9h[n] 



where 



2 R[n + l]h[n] - J[n + l]h[n] 

||h[n]||2 



(8.51) 





+ 1] 




Ri£,Xo[^ 4" 1] 




i^O 








~Rocoxi T 1] 


Rxili [■’^ + 1] 


~Rx^xi [n + 1] 


R[n+1] = 












T- 1] 


• X]Rx.xi[n+l] 








i^L 
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and 

RxiXj[n + l] =Xi[« + l]xJ[’i+l]> i,j =0,1,- ,L. 

If the modeling filter is normalized after each update, then a simplified algorithm 
is obtained: 

h[n] - 2// R[n + l]h(n] — J[n + l]h[«] 

h[n + 1] = \r-_ : — f- . (8.52) 

h[n] - 2fi R[n + l]h[n] — J[n + l]h[tt] 

It was shown theoretically and demonstrated empirically in [44] that the 
MCLMS algorithm converges in mean to the desired impulse responses. 

6.3 FREQUENCY-DOMAIN ADAPTIVE 
ALGORITHMS 

The time-domain MCLMS algorithm is simple to implement and it converges 
steadily, but its convergence rate is slow. There are many ways to accelerate the 
convergence rate. One is to implement the adaptive algorithm in the frequency- 
domain, which takes advantage of the the fast Fourier transform (FFT) to make 
the adaptive process more efficient [32]. 

To begin, we define an intermediate signal yij = X{ * hj, the convolution 
result of the tth channel output Xi and the jth model filter hj. In vector form, 
a block of such a signal can be expressed in the frequency domain as 

Vijlf^ + 1 ] = y^Mx2M'^xi[m + ( 8 - 53 ) 

where 

~ FmxM [ OmxM ImxM ] ^2Mx2M^ 

= diag{F2A/x2M -Xilm + l]2wxi} , 

W 2 MXM ~ ^2Mx2M [ ImxM ^MxM ] Fa/xM> 
hj[m] = FMxM^j[m], 

3t,[m -I- 1]2 Mxi = [xi[mM] Xi[mM + l] Xi[mM + 2M - 1]]^ , 

FmxM are respectively the Fourier and inverse Fourier matrices 

of size M X M, and m is the block time index. Then a block of the error signal 
based on the cross relation between the ith and the jth channel in the frequency 
domain is determined as: 

^ijlm + l] = yij[m-\-l]-y..[m + \] 

= VV^x2M + l]VV2MxM^jM“ 

T>xj [m -f 1]W2M X mhi M • 



(8.54) 
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Continuing, we can construct a (frequency-domain) cost function at the (m 
l)th time block index as follows: 



L-l L 

J([m -f 1] = ^ ^ + l]eij[m -f 1]. (8.55) 

i=0 j=i+l 

Therefore, by minimizing Jf[m -f 1], the modeling filter can be updated in the 
frequency domain as: 

hk[m + 1] = hf,[m] - Mf . A; = 0, 1, ■ • ■ , L, (8.56) 

dhk[m] 

where /if is a small positive step size. It can be shown that [23] 

jL — ~ = y~! \y^%x2M'^Xi[‘m + 1]W2Mxm] + (8-57) 

dhk[m] ^^0 

Substituting (8.57) into (8.56) yields the multichannel frequency-domain LMS 
(MCFLMS) algorithm: 

kk[m+l] = 

L 

“ MfWMx2M + 1], (8.58) 

t=0 



where 



y^2MxM 



= FmxM [ ImxM 0mxm]F2Mx2M> 
= F2Mx 2M [ 0/WxM ImxM 



A constraint ensuring that the adaptive algorithm would not converge to a trivial 
solution with all zero elements can be applied in either the frequency or the 
time domain. Since in the application of time delay estimation the delay will be 
estimated in the time domain and the time-domain modeling filter coefficients 
have to be computed anyway, we will enforce the constraint in the time domain 
for convenience. 

The MCFLMS is computationally more efficient than a multichannel time- 
domain block LMS algorithm. However, the MCFLMS and its time-domain 
counterpart are equivalent in performance. The convergence of the MCFLMS 
algorithm is still slow because of nonuniform convergence rates of the filter 
coefficients and cross-coupling between them. To accelerate convergence, we 
will use Newton’s method to develop a normalized MCFLMS (NMCFLMS) 
method. 
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By using Newton’ s method, the coefficients of the model filter can be updated 
according to: 



+ 1 ] = 



h.k 



m 



- 



-1 



- T 

dhk [m] 



dJ([m + 1] 
dh^[m] 



dJ([m + 1] 
dhl[m] 



(8.59) 



where the Hessian matrix can be evaluated as 



E 



d dJ([m + 1] 
dh^[m] . dhl[m] 

L 



= W 



10 

Mx2M’ 






(8.60) 



and 



W 



01 

2Mx2M 



A 






OL 

Mx2M 



E2Mx2M 



Omxm 

OmxM 



OmxM 

ImxM 



F 



-1 

2Mx2M- 



As shown in [23], when M is large, 2W2 a^x 2M approximated by 

the identity matrix 



2W 



01 

2Mx2M 



^ l2Mx2M- 



(8.61) 



Thus, (8.60) becomes 



1 s 


dJ([m + 1] 


1 dtkk [m] 


dklim) 



\y^Mx2M'Pk[m + l]Wi%,M, ( 8 - 62 ) 



where 



L 

'Pk[m + l]= Y1 E{Vl[m + l]V:,,[m + l]}, k = 0,l,--- ,L. 

i=0,i^k 
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Substituting (8.57) and (8.62) into (8.58) and multiplying by 
the constrained NMCFLMS algorithm: 

rlOf 

hk [m + 1] 

= ~ 2/ifW2MxM { + 1]W2 Mxm} 

L 

Wmx 2M X^^.T,k + + 1] 

1=0 

= 4fc°H - + 1] • 

L 

J]p;jm + l]4‘[^ + l]i (8.63) 

i=0 



where 





= '^TMxMh.ki'^] = F2MX2M [ hj[m] 


0]^’ 


+ 1] 




+ 1] 


= F2Mx2M 0 




»^2Mx2M 


— W2MxmWmx2M 










17 


ImxM 


0/WxM 


1 Tn — 1 






— F2Mx2M 


OmxM 


OmxM 


2Mx2M> 



and the relation 

^2MxM {Wji5x2M’^A:[m. + 1]W2 Mxm} ^Mx2M = 

'^2Mx2M’^fc + 1] 

can be justified by post-multiplying both sides of the expression by 
Vk[m + 1 ]W 2 MxM and recognizing that W2Mx2mW2MxM = W2MxM- 
If the matrix 2 W 25 ^ x2M approximated by the identity matrix similar to 
(8.61) for W2A/X2A/’ we finally deduce the unconstrained NMCFLMS algo- 
rithm: 



hl°[m -b 1] = hl^[m] ~ ii(Vk ^[m + + l]e°fc*[m -b 1], (8.64) 

i=0 

where the normalization matrix 7^k['>T^ + 1] is diagonal and it is easy to find its 
inverse. Again, the unit-norm constraint will be enforced on the modeling filter 
coefficients in the time domain. 

Detailed implementation of the aforementioned adaptive multichannel 
(AMC) algorithms can be found in [32]. 
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Figure 8.3 Varechoic chamber floor plan (coordinate values measured in meters); the loud- 
speaker source is located at (0.337, 2.162, 1.600); six microphones are placed at (2.437, 5.600, 
1.400), (2.537, 5.600, 1.400), (2.637, 5.600, 1.400), (2.737, 5.600, 1.400), (2.837, 5.600, 1.400), 
(2.937, 5.600, 1.400), respectively. 

7. EXPERIMENTS 

7.1 EXPERIMENTAL SETUP 

The measurements used in this chapter were made in the Varechoic chamber 
at Bell Labs [45]. A diagram of the floor plan layout is shown in Fig. 8.3. For 
convenience, positions in the floor plan will be designated by (x, y) coordinates 
with reference to the southwest corner and corresponding to meters along the 
(South, West) walls. The chamber is of size 6.7m x 6.1m x 2.9m (x X y X z) 
with 368 electronically controlled panels that vary the acoustic absorption of 
the walls, floor, and ceiling [46]. Each panel consists of two perforated sheets 
whose holes, if aligned, expose sound absorbing material behind, but if shifted 
to misalign, form a highly reflective surface. The panels are individually con- 
trolled so that the holes on one particular panel are either fully open (absorbing 
state) or fully closed (reflective state). Therefore, by varying the binary state of 
each panel in any combination, 2^^® different room characteristics can be sim- 
ulated. A linear microphone array with six omni-directional microphones was 
employed in the measurement and the spacing between adjacent microphones 
is 10 cm. The array was mounted 1.4 m above the floor and parallel to the 
North wall at a distance of 50 cm. The six microphone positions are denoted as 
Ml (2.437, 5.600, 1.400), M2 (2.537, 5.600, 1.400), M3 (2.637, 5.600, 1.400), 
M4 (2.737, 5.600, 1.400), M5 (2.837, 5.600, 1.400), and M6 (2.937, 5.600, 
1.400), respectively. The source was simulated by placing a loudspeaker at 
(0.337, 2.162, 1.600). The transfer functions of the acoustic channels between 
the loudspeaker and six microphones were measured at a 48 kHz sampling 




220 



Audio Signal Processing 




Figure 8.4 Speech signal used as ihe source, sampled al 16 kHz. 



rate. Then the obtained channel impulse responses were downsampled to a 16 
kHz sampling rate and truncated to 4096 samples. These measured impulse re- 
sponses will be treated as the actual impulse responses in the TDE experiments. 

The source signal is a sequence of a clean speech (from a female speaker) 
sampled at 16 kHz and of duration 2 minutes. The signal waveform is shown 
in Fig. 8.4. The multichannel system output is computed by convolving the 
speech source with the corresponding measured channel impulse responses 
and adding zero-mean, white, Gaussian noise to each one of these outputs for 
a given signal-to-noise ratio (SNR). 

7.2 PERFORMANCE MEASURE 

To better evaluate the performance of a time delay estimator, it would be 
helpful to classify an estimate as either “success” or “failure” [18, 29]. An 
estimate fi for which the absolute error |fi — T{\ exceeds Tc/2, where Tc is 
the signal correlation time, and r, the true delay, is identified as a failure (or 
anomaly), which follows the terminology used in [29]. Otherwise, an estimate 
would be deemed as a success (or nonanomaly). In this chapter, is defined as 
the width of the main lobe of the source signal autocorrelation function (taken 
between the - 3 dB points). For the particular source signal used here, which 
is sampled at 16 kHz, Tc is equal to 4.7 samples. 

After time delay estimates are classified into the two classes, the TDE per- 
formance is evaluated in terms of the percentage of anomalies over the total 
estimates, and the MSE of the nonanomalous estimates. 

The GCC (Section 3) and AED (Section 5) algorithms, using signals received 
by two sensors, estimate one delay, which is the TDOA between the two sensors 
(we assume they are Sensor 0 and 1). The multichannel cross-correlation (MCC) 
algorithm (Section 4.2 and 4.3), although exploring multiple sensors, also gen- 
erates one time delay estimate (without loss of generality, we assume that is the 
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Figure 8.5 The TDE performances of the PHAT, AED, MCC, and AMC algorithms; Teo = 240 
ms, SNR == 10 dB. 



TDOA between Sensor 0 and 1). For the adaptive multichannel (AMC) TDE 
algorithm [the unconstrained NMCFLMS algorithm (Section 6.3)] where more 
than two channels are available, a time delay estimate for each sensor pair will 
be achieved. For the purpose of a fair comparison among algorithms, we only 
evaluate the time delay between Sensor 0 and 1, and delay estimates between 
other sensor pairs will be neglected. 

7.3 EXPERIMENTAL RESULTS 

For brevity, we cite four sets of experimental results . The first one involves a 
set of data obtained in a light reverberant and less noisy environment where the 
reverberation time, Teo (defined as the time for the sound to die away to a level 
60 decibels below its original level and measured by Schroeder’s method [47]), 
is approximately 240 ms, and SNR = 10 dB. The TDE results are presented 
in Fig. 8.5. As seen, all the four algorithms can accurately determine the 
relative TDOA with no anomalies being observed in this situation. For the two- 
microphone case, the four algorithms yield similar MSE. When more than two 
microphones are employed, the MSE of both the MCC and AMC algorithms 
reduces, showing the advantage of using multiple microphones. (In this and 
following figures, the fitting curve is a second order polynomial.) 

The second experiment pertains to a set of data acquired in a condition where 
the reverberation is the same as in the previous setup, but this time the noise 
is much stronger with an SNR = -5 dB. The result is graphically portrayed in 
Fig. 8.6. It is noticed that all four algorithms deteriorate in their performance. 
Since the SNR is very low in this case, noise is the dominant distortion source 
that causes the performance degradation. From Fig. 8.6 (a), one can see that 
in the two-sensor case, the cross correlation based algorithms perform slightly 
better than the blind channel identification based approaches, suggesting that 
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Figure 8.6 The TDE performance of the PHAT, AED, MCC, and AMC algorithms; Teo = 240 
ms, SNR =: -5 dB. 



the cross correlation based methods are more tolerant to strong noise. Again, 
the TDE performance of both the MCC and AMC algorithms improves with 
the number of sensors. 

When reverberation becomes heavier, each microphone sensor receives a 
greater number of delayed and attenuated replicas of the source signal due to 
reflection of room boundaries, which makes the TDE problem harder. The 
third experiment considers a stronger reverberant but less noisy environment 
where Teo = 580 ms and SNR = 10 dB. The TDE result is shown in Eig. 8.7. 
Comparing Eig. 8.7 with Eig. 8.5, one can see that all the studied algorithms 
suffer performance degradation when reverberation increases. It is noticed from 
Eig. 8.7 (b) that in the two-microphone case, the blind channel identification 
based techniques have lower MSE compared to the cross correlation based 
algorithms, indicating that the former techniques are more robust with respect 
to strong reverberation. 

The final experiment shows the TDE performance behavior in a heavily 
reverberant and strongly noisy environment. As seen from Eig. 8.8, all four 
algorithms suffer dramatic degradation in their performance. However, as the 
number of sensors increases, the performance of both the MCC and AMC 
algorithms improves. It is noted that both the MCC and AMC algorithms can 
greatly benefit from the use of multiple sensors. Among these two, the MCC 
algorithm improves faster in its performance with the number of microphones 
than does the AMC method. This is because the former exploits multiple 
microphones to estimate only one delay, which fully utilizes the redundancy 
provided by the array, while the latter technique estimates the TDOAs between 
each microphone pair. Since the MCC algorithm takes advantage of the array 
geometry to improve its performance, the array has to be well designed and 
calibrated. Eor the AMC algorithm, however, the TDOA between each sensor 
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Figures.? TheTDEpert'ormancesofthePHAT, AED,MCC,andAMCalgorithms; Teo = 580 
ms, SNR = 10 dB. 
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Figure 8.8 The TDE performance of the PHAT, AED, MCC, and AMC algorithms; Teo = 580 
ms, SNR = -5 dB. 



pair is estimated based on identifying their impulse responses, so the array 
geometry is not of great concern. 

8. CONCLUSIONS 

Time delay estimation (TDE) in a reverberant acoustic environment still 
remains a very challenging and difficult problem. There are mainly two ap- 
proaches to deal more efficiently with reverberation. The first is to use more 
than two sensors and take advantage of the redundancy. The second is the inclu- 
sion of the acoustic channel impulse responses into the TDE algorithms. This 
chapter reviewed some recent efforts in developing TDE techniques that are 
robust against reverberation. Addressed were the generalized cross-correlation 
(GCC) method, the multichannel cross-correlation (MCC) technique, the adap- 
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tive eigenvalue decomposition (AED) algorithm, and the adaptive multichannel 
(AMC) algorithms. 

The GCC method is a two-sensor technique based upon the ideal single- 
path acoustic propagation model. It performs fairly well in moderate noise and 
reverberation conditions when the two prefilters are properly selected. However, 
it suffers severe performance degradation in the presence of strong noise and 
heavy reverberation. The MCC algorithm is a natural generalization of the 
classical cross-correlation method to the multichannel case. It takes advantage 
of the redundancy provided by multiple sensors to estimate one time delay. 
The TDE performance of the MCC algorithm in the presence of noise and 
reverberation improves with the number of microphone sensors. The AED 
algorithm is also a two-sensor technique. Different from the CCC method, it 
is based on the reverberant propagation model and obtains the delay estimates 
based on blindly identifying the two impulse responses. The multichannel 
adaptive algorithm is an extension of the AED approach, which improves time 
delay estimation by exploiting the diversity among multiple channels. 

Experiments were performed using data measured in the Varechoic chamber 
at Bell Labs. It was shown that in the two-microphone case, the correlation 
based techniques are more robust with respect to noise, while more sensitive 
to reverberation than the blind channel identification based algorithms. When 
multiple microphone sensors are available, both the MCC and AMC algorithms 
improve the TDE performances with the number of microphones, with the 
former coping better than the latter against reverberation and noise. However, 
to make the MCC algorithm efficient, the microphone array has to be well 
designed and calibrated, which is not of big concern for the AMC method since 
it estimates TDOA between each microphone pair by identifying the channel 
impulse responses. 
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Abstract Time delays of arrival are important parametric representations of acoustic signals 
captured by a passive microphone array. But they are rarely directly used in an 
array signal processing system that usually assumes the explicit knowledge of 
a sound source location. In this chapter we turn our attention to passive source 
localization techniques that extract the spatial information of a sound source from 
time delays of arrival estimated using approaches investigated in the previous 
chapter. Parametric estimation is in general difficult when the signal model 
is nonlinear. Such is the case in the problem of source localization. We will 
review a number of advanced approaches with some examples and comment 
on their technical merits and shortcomings for practical implementations. A 
successful real-time acoustic source localization system for video camera steering 
in teleconferencing is presented at the end of this chapter. 

Keywords: Source Localization, Estimation Theory, Least Squares, Lagrange Multiplier, 
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1. INTRODUCTION 

As precise context understanding helps a reader to comprehend the correct 
connotation that a message carries, meticulous spatial perception makes it a lot 
easier for a listener to grasp the gist and even the implication of a conversa- 
tion that he or she is involved, particularly when there are multiple participants. 
While the former has been well documented and accepted, the latter is not given 
adequate attention mainly because we make no conscious effort to localize a 
sound source. However, as a matter of fact, the knowledge of environmental 
sound sources and the capability of tracking them are essential to natural conver- 
sations and collaborations that are pursued in the next-generation multimedia 
telecommunication systems. 

Locating radiative point sources using passive, stationary sensor arrays is of 
considerable interest and has been a repeated theme of research in radar [1], [2], 
underwater sonar [3], and seismology [4]. A common method is to have the es- 
timate of source location based on time delay of arrival (TDOA) measurements 
between distinct sensor pairs. Nowadays, the same kind of techniques is used 
to localize and track acoustic sources for emerging applications such as auto- 
matic camera tracking for video-conferencing [5], [6], [7], [8] and beamformer 
steering for suppressing noise and reverberation [9], [10], [11], [12] in all types 
of communication and voice processing systems. We believe that such a time 
delay estimation (TDE) based method will continue playing an important role 
in tomorrow’s multimedia communication systems. 

In order to estimate the location of a single sound source using estimated 
TDOAs, one needs to first choose a data model which describes how a source 
location is related to TDOA observations and bow noise or measurement error is 
introduced. If errors (that are possibly mutually dependent) are supposed to be 
additive to and independent of the TDOA measurements, the source would be 
located at the intersection of a set of hyperboloids. Finding this intersection is a 
nonlinear problem. Although such an additive model does not easily lend itself 
to modification due to the nonlinearity, it describes the principal constraints 
imposed by the TDOA data in a simple way and thus is widely used in studying 
the source localization problem. 

There is a rich literature of source localization techniques that use the addi- 
tive measurement error model. Important distinctions between these methods 
include likelihood-based versus least-squares and linear approximation versus 
direct numerical optimization, as well as iterative versus closed-form algo- 
rithms. 

In early research of source localization with passive sensor arrays, the max- 
imum likelihood (ML) principle was widely utilized [13], [14], [15], [16] be- 
cause of the proven asymptotic consistency and efficiency of an ML estimator 
(MLE). However, the number of microphones in an array for camera point- 
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ing or beamformer steering in multimedia communication systems is always 
limited, which makes acoustic source localization a finite-sample rather than 
a large-sample problem. Moreover, ML estimators require additional assump- 
tions about the distributions of the measurement errors. One approach is to 
invoke the central limit theorem and assumes a Gaussian approximation, which 
makes the likelihood function easy to formulate. Although a Gaussian error 
was justified by Habn and Tretter [13] for continuous-time processing, it can be 
difficult to verify and the MLE is no longer optimal when sampling introduces 
additional errors in discrete-time processing. To compute the solution to the 
MLE, a linear approximation and iterative numerical techniques have to be used 
because of the nonlinearity of the hyperbolic equations. The Newton-Raphson 
iterative method [17], the Gauss-Newton method [18], and the least-mean- 
square (EMS) algorithm are among possible choices. But for these iterative 
approaches, selecting a good initial guess to avoid a local minimum is diffi- 
cult and convergence to the optimal solution cannot be guaranteed. Therefore, 
it is our opinion that an ME-based estimator is not suitable for the real-time 
implementation of a source localization system. 

Eor real-time applications, closed-form estimators are desired and appropri- 
ately, have also gained wider attention. Of the closed-form estimators, triangu- 
lation is the most straightforward [6]. However, with triangulation it is difficult 
to take advantage of extra sensors and the TDOA redundancy. Nowadays most 
closed-form algorithms exploit a least-squares principle, which makes no addi- 
tional assumption about tbe distribution of measurement errors. To construct a 
least-squares estimator, one needs to define an error function based on the mea- 
sured TDOAs. Different error functions will result in different estimators with 
different complexity and performance. Schmidt [19] showed that the TDOAs 
to three sensors whose positions are known provide a straight line of possi- 
ble source locations in two dimensions and a plane in three dimensions. By 
intersecting the lines/planes specified by different sensor triplets, he obtained 
an estimator called plane intersection. Another closed-form estimator, termed 
spherical intersection (SX), employed a spherical ES criterion [20]. The SX 
algorithm is mathematically simple, but requires an a priori solution for the 
source range, which may not exist or may not be unique in tbe presence of mea- 
surement errors. Based on the same criterion. Smith and Abel [21] proposed 
the spherical interpolation (SI) method, which also solved for the source range, 
again in the ES sense. Although the SI method has less bias, it is not efficient 
and it has a large standard deviation relative to the Cramer-Rao lower bound 
(CREB). With the SI estimator, the source range is a byproduct that is assumed 
to be independent of tbe location coordinates. Chan and Ho [22] improved the 
SI estimation with a second ES estimator that accommodates the information 
redundancy from the SI estimates and updates the squares of the coordinates. 
We shall refer to this method as the quadratic-correction least-squares (QCES) 
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approach. In the QCLS estimator, the covariance matrix of measurement errors 
is used. But this information can he difficult to properly assume or accurately 
estimate, which results in a performance degradation in practice. When the SI 
estimate is analyzed and the quadratic correction is derived in the QCLS esti- 
mation procedure, perturbation approaches are employed and, presumptively, 
the magnitude of measurement errors has to he small. It has been indicated in 
[23] that the QCLS estimator yields an unbiased solution with a small standard 
deviation that is close to the CRLB at a moderate noise level. But when noise 
is practically strong, its bias is considerable and its variance could no longer 
approach the CRLB according to our Monte-Carlo simulations. Recently a 
linear-correction least-squares (LCLS) algorithm has been proposed by the au- 
thors in [24]. This method applies the additive measurement error model and 
employs the technique of Lagrange multipliers. It makes no assumption on the 
covariance matrix of measurement errors and utilizes no linear approximation 
that holds only in the case of small perturbation. 

In the following, we will develop a comprehensive framework for investigat- 
ing the problem of source localization and will comparatively study a number 
of approaches. We will evaluate all these algorithms with respect to estimation 
accuracy and efficiency, computational complexity, implementation flexibility, 
and adaptation capabilities to different and varying environments. 

2. SOURCE LOCALIZATION PROBLEM 

The problem addressed here is the determination of the location of an acoustic 
source given the array geometry and the relative TDOA measurements among 
different microphone pairs. The problem can be stated mathematically as fol- 
lows. 

The array consists of A -i- 1 microphones located at positions 

Ti = [ yi ZiY,i = Q,...,N (9.1) 

in Cartesian coordinates (see Fig. 9.1), where (-)"^ denotes transpose of a vector 
or a matrix. The first microphone {i = 0) is regarded as the reference and is 
placed at the origin of the coordinate system, i.e. ro = [0,0,0]^. The acoustic 

source is located attg = [xs,ys)^s]^- Th^ distances from the origin to the i-th 
microphone and the source are denoted by Ri and R^, respectively, where 

Ri = Ikill = A (9.2) 

Rs = Iks II = s/xj + yj + zl. (9.3) 

The distance between the source and the Lth microphone is denoted by 



A = Ikt - rsll = \/ {xi - Xs)2 -I- {yi - ys)^ -k (zj - ^s)^ . 



(9.4) 
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Figure 9.1 Spatial diagram illustrating variables defined in the source localization problem. 



The difference in the distances of microphones i and j from the sonrce is given 
by 



dij = D^- Dj, z, j = 0, N. (9.5) 

This difference is nsnally termed the range dijference. It is proportional to the 
time delay of arrival If the speed of sonnd is c, then 

dij = C‘Tij. (9.6) 

The speed of sonnd (in m/s) can be estimated from the air temperatnre fair (in 
degrees Celsius) according to the following approximate (first-order) formula, 

c « 331+ 0.610 X fair. (9.7) 

The localization problem is then to estimate Fg given the set of r j and Ty. 
Note that there are (N + l)N/2 distinct TDOA estimates tij, which exclnde 
the case i ~ j and connt the Tij — —Tji pair only once. However, in the 
absence of noise, the space spanned by these TDOA estimates is A-dimensional. 
Any N linearly independent TDOAs determine all of the others. In a noisy 
environment, the TDOA rednndancy can be nsed to improve the accnracy of 
the sonrce localization algorithms, bnt this wonld increase their compntational 
complexity. For simplicity and also withont loss of generality, we choose 
Tio,i = 1, ..., as the basis for this space in this chapter. 
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3. MEASUREMENT MODEL AND CRAMER-RAO 
LOWER BOUND FOR SOURCE LOCALIZATION 

When the source localization problem is examined using estimation theory, 
the measurements of the range differences are modeled by: 

dio = gi{rs) + ei, i = (9.8) 

where, 

gii^s) = ||rj - Fsll - lirsll, 

and the e^’s are measurement errors. In a vector form, such an additive mea- 
surement error model becomes. 



d = g(fs) + e, 



(9.9) 



where 



d 



g(rs) 



e 



[ dio ^20 • • • di\/o , 

[ 5i(rs) 92{rs) ••• ,9/v(rs) ]^, 

[ 62 • • • 6yv ]^ . 



Further, we postulate that the additive measurement errors have mean zero 
and are independent of the range difference observation, as well as the source 
location Fg. For a continuous-time estimator, the corrupting noise, as indicated 
in [13], is jointly Gaussian distributed. The probability density function (PDF) 
of d conditioned on Fg is subsequently given by. 



P(d|Fg) 



exp {-i [d - g(Fs)]'^ Cgi [d - g(Fs)]} 

v/(27T)"det(Cj ~ 



(9.10) 



where Ce is the covariance matrix of e and “det” denotes the determinant. 
Note that Ce is independent of Fg by assumption. Since digital equipment is 
used to sample the microphone waveforms and estimate the TDOAs, the er- 
ror introduced by discrete-time processing also has to be taken into account. 
When this is done, the measurement error is no longer Gaussian and is more 
properly modeled as a mixture of a Gaussian noise and a noise that is uni- 
formly distributed over [— Tgc/2, Tgc/2], where Tg is the sampling period. As 
an example, for a digital source location estimator with an 8 KHz sampling rate 
operating at room temperature (25 degrees Celsius, i.e. c ~ 346.25 meters per 
second), the maximum error in range difference estimates due to sampling is 
about ±2.164 cm, which leads to considerable errors in the location estimate, 
especially when the source is far from the microphone array. 
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Under the measurement model (9.9), we are now faced with the parame- 
ter estimation problem of extracting the source location information from the 
mismeasured range differences or the equivalent TDOAs. For an unbiased es- 
timator, a Cramer-Rao lower bound (CRLB) can be placed on the variance of 
each estimated coordinate of the source location. However, since the range 
difference function g(rs) in the measurement model is nonlinear in the pa- 
rameters under estimation, it is very difficult (or even impossible) to find an 
unbiased estimator that is mathematically simple and attains the CRLB. The 
CRLB is usually used as a benchmark against which the statistical efficiency 
of any unbiased estimators can be compared. 

In general, without any assumptions made about the PDF of the measurement 
error e, the CRLB of the t-th {i = 1, 2, 3) parameter variance is found as the 
[?, i] element of the inverse of the Fisher information matrix defined by [25]: 



[I[rs)]ij = -E 



lnp(d|rs) ~ 
drs,idrs,j ’ 



(9.11) 



where the three parameters ofrg, i.e., rgp, rg_ 2 , and are respectively x, y, 
and z coordinates of the source location. 

In the case of a Gaussian measurement error, the Fisher information matrix 
turns into [22] 



I(rs) = 



’9g(rs)‘ 


T 

cr^ 


[9g(rs)l 


dvs 







(9.12) 



where 9g(rs)/9rsis an iV x 3 Jacobian matrix defined as. 



^g(rs) 



dr 



s 



r 9(/i(rs) 
9xs 

9.92 (rs) 
9xs 


9gi(rs) 

dys 

9,92 (rs) 
9j/s 


Sgiirs) 1 
dzs 

9.92 (rs) 
dzs 




(ui - Uo)'^ 

(U2 - Uo)"^ 


dgsirs) 
L 9xs 


99w(rs) 

9j/s 


dgnirs) 

9zs J 




1 

o 

=) 

■■ 1 

1 



(9.13) 



and. 




0,1, ...,(V 



(9.14) 



is the normalized vector of unit length pointing from the Uth microphone to the 
sound source. 



4. MAXIMUM LIKELIHOOD ESTIMATOR 

In the previous section, the measurement model for the source localization 
problem was investigated and the CRLB for any unbiased estimator was deter- 
mined. Since the measurement model is highly nonlinear, an efficient estimator 
that attains the CRLB may not exist or might be impossible to find even if it 
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does exist. In practice, the maximum likelihood estimator is the most popular 
approach. It has the well-proven advantage of asymptotic efficiency for a large 
sample space. 

To apply the maximum likelihood principle, the statistical characteristics of 
the measurements need to be known or properly assumed prior to any process- 
ing. From the central limit theorem and also for mathematical simplicity, the 
measurement error is usually modeled as Gaussian and the likelihood function 
is given by (9.10), which is considered as a function of the source position rg 
under estimation. 

Since the exponential function is monotonically increasing, the MLE is 
equivalent to minimizing a (log-likelihood) cost function defined as, 

^MLE(rs) = [d - g(rs)]^Ce‘ [d - g(rg)] . (9.15) 

Direct estimation of the minimizer is generally not practical. If the noise at 
different microphones is assumed to be uncorrelated, the covariance matrix is 
diagonal: 

Cc = diag(cTi , CT2 , (9. 16) 
where af {i = 1,2,.. ,,N) is the variance of €{, and the cost function (9.15) 
becomes, 

= (9.17) 

Among other approaches, the steepest descent algorithm can be used to find 
rs,MLE iteratively with 

rs{k + 1) = fs(/c) - V <^MLE(fs(A;)) , (9.18) 

where jj, is the step size. 

The foregoing MLE can be determined and is asymptotically optimal for this 
problem only if its two assumptions (Gaussian and uncorrelated measurement 
noise) hold. However, this is not the case in practice as discussed in Section 
3. Eurthermore, the number of microphones in an array for camera pointing or 
beamformer steering is always limited, which makes the source localization a 
finite-sample rather than a large-sample problem. In addition, the cost function 
(9.17) is generally not strictly concave. In order to avoid a local minimum 
with the steepest descent algorithm, we need to select a good initial guess of 
the source location, which is difficult to do in practice, and convergence of the 
iterative algorithm to the desired solution cannot be guaranteed. 

5. LEAST SQUARES ESTIMATORS 

Two limitations of the MLE are that probabilistic assumptions have to be 
made about the measured range differences and that the iterative algorithm to 
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find the solution is computationally intensive. An alternative method is the 
well-known least squares estimator (LSE). The LSE makes no probabilistic 
assumptions about the data and hence can be applied to the source localization 
problem in which a precise statistical characterization of the data is hard to 
determine. Eurthermore, an LSE usually produces a closed-form estimate that 
is desirable in real-time applications. In this section, we begin by investigating 
the least squares (LS) error criteria and then develop different LS approaches 
to source localization. 

5.1 THE LEAST SQUARES ERROR CRITERIA 

In the LS approach, we attempt to minimize a squared error function that is 
zero in the absence of noise and model inaccuracies. Different error functions 
can be defined for closeness from the assumed (noiseless) signal based on 
hypothesized parameters to the observed data. When these are applied, different 
LSEs will be derived. Eor the source localization problem two LS error criteria 
can be constructed and will be presented here. 

5.1.1 Hyperbolic LS Error Function. The first LS error function is de- 
fined as the difference between the observed range difference and that generated 
by a signal model depending upon the unknown parameters. Such an error func- 
tion is routinely used in many LS estimators 

eh(rs) = d - g(rs), (9.19) 

and the corresponding LS criterion is given by 

A = eh eh = [d - g(rs)]^[d - g(rs)]. (9.20) 

In the source localization problem, an observed range difference dio defines 
a hyperboloid in 3D space. All points lying on such a hyperboloid are potential 
source locations and all have the same range difference diQ to the two micro- 
phones i and 0. Therefore, a sound source that is located by minimizing the 
hyperbolic LS error criterion (9.20) has the shortest distance to all hyperboloids 
associated with different microphone pairs and specified by the estimated range 
differences. 

In (9.19), the signal model g(rs) consists of a set of hyperbolic functions. 
Since they are nonlinear, minimizing (9.20) leads to a mathematically in- 
tractable solution as N gets large. Moreover, the hyperbolic function is very 
sensitive to noise, especially for far-field sources. As a result, it is rarely used 
in practice. 

When the statistical characteristics of the corrupting noise are unknown, 
uncorrelated white Gaussian noise is one reasonable assumption. In this case, 
it is not surprising that the hyperbolic LSE and the MLE minimize (maximize) 
similar criteria. 
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5.1.2 Spherical LS Error Function. The second LS criterion is based 
on the errors found in the distances from a hypothesized source location to 
the microphones. In the absence of measurement errors, the correct source 
location is preferably at the intersection of a group of spheres centered at the 
microphones. When measurement errors are present, the best estimate of the 
source location would be the point that yields the shortest distance to those 
spheres defined by the range differences and the hypothesized source range. 

Consider the distance Dj from the ?-th microphone to the source. From the 
definition of the range difference (9.5) and the fact that Dq = R^, we have: 

Di = Rs + dio, (9.21) 

where Di denotes an observation based on the measured range difference. From 
the inner product, we can derive the true value for D^, the square of the noise- 
free distance generated by a spherical signal model 

= ||ri - rsiP = R'f - 2rJ r, + RI (9.22) 



The spherical LS error function is then defined as the difference between the 
measured and hypothesized values 



esp,i(rs) = 



(9.23) 



= rjrs + dioRs- -{R- - dfo), 



Putting the N errors together and writing them in a vector form gives. 



where. 



esp(rs) = A0 - b, 



A ^ [S|d], S ^ 



x\ yi zi 

X2 V2 Z2 



XN VN Z]^ 



0 ^ 


Xs 

Vs 




■ r! 
Rl 


0 o 

1 1 




Zs 

Rs 
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1 


- (fi 
“tvo J 



(9.24) 



and [Sid] indicates that S and d are stacked side-by-side. The correspond- 
ing LS criterion is then given by: 

Jsp = efpCsp = [Ad - bf [A0 - b]. 



(9.25) 
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In contrast to the hyperbolic error function (9.19), the spherical error function 
(9.24) is linear in Fs given and vice versa. Therefore, the computational 
complexity to find a solution will not dramatically increase as N gets large. 



5.2 SPHERICAL INTERSECTION (SX) ESTIMATOR 

The SX source location estimator employs the spherical error and solves the 
problem in two steps [20]. First, we find the least-squares solution for Fg in 
terms of i?g, 

Fs = St(b-fisd), (9.26) 

where, 

= (S^S)~^S^ 

is the pseudo-inverse of matrix S. Then, substituting (9.26) into the constraint 
= rjvs yields a quadratic equation as follows 



= [st(b - Rsd)] [st(b - Rsd) . 

After expansion, it becomes 

aRl + bRs -t- c = 0, 



(9.27) 



(9.28) 



where, 

a = 1 - \\S^df, b = 2b^S^'^Std, c = -HS^blp. 

The valid (real, positive) root is taken as an estimate of the source range Rs and 
is then substituted into (9.26) to calculate the SX estimate Fg^sx of the source 
location. 

In the SX estimation procedure, the solution of the quadratic equation (9.28) 
for the source range jRg is required. This solution must be a positive value by all 
means. If a real positive root is not available, the SX solution does not exist. On 
the contrary, if both of the roots are real and greater than 0, then the SX solution 
is not unique. In both cases, the SX source location estimator fails to produce 
a reliable estimate, which is not desirable for a real-time implementation. 

5.3 SPHERICAL INTERPOLATION (SI) ESTIMATOR 

In order to overcome the drawback of the SX algorithm, a spherical inter- 
polation estimator was proposed in [26] which attempts to relax the restriction 
Rg = ||rg|| by estimating R^ in the least-squares sense. 

To begin, we substitute the least-squares solution (9.26) into the original 
spherical equation AS = b to obtain 



RsP gj-d — Pgib, 



(9.29) 
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where, 

Ps^ = liVxiv - SSt, (9.30) 



and anA^x N identity matrix. Matrix Pgi is a projection matrix that 

projects a vector, when multiplied hy the matrix, onto a space that is orthogonal 
to the column space of S. Such a projection matrix is symmetric (i.e. Pgx = 
Pgj.) idempotent (i.e. Pg± = Pgx 'Pgx). Then the least-squares solution 
to (9.29) is given hy 




d^Pgxb 

d^Pgxd' 



(9.31) 



Substituting this solution into (9.26) yields the SI estimate 



Ts.S! 



= st 






/dd'^Pgx 

[dTPs^d 



)]b. 



(9.32) 



In practice, the SI estimator performs better, but is computationally a little bit 
more complex, than the SX estimator. 



5.4 LINEAR-CORRECTION LEAST SQUARES 
ESTIMATOR 

Finding the LSE based on the spherical error criterion (9.25) is a linear 
minimization problem 

min {AO - b)^(A0 - b) (9.33) 

6 

subject to a quadratic constraint 

e^'EO = 0, (9.34) 

where E = diag(l, 1, 1, —1) is a diagonal and orthonormal matrix. 

For such a constrained minimization problem, the technique of Lagrange 
multipliers will be used and the source location is determined by minimizing 
the Lagrangian 

C{0,X) = Jsp + XO'^'SO 

= (A0 - b)^(A0 - b) -F 

where A is the Lagrange multiplier. Expanding this expression yields 

£(0, A) = 6>^(A'^A + AE)0 - 2b^A0 + b^b. (9.35) 

Necessary conditions for minimizing (9.35) can be obtained by taking the gra- 
dient oiC{0, A) with respect to 0 and equating the result to zero. This produces: 

= 2 (A^A + AS) ^ - 2A^b = 0. 

o6 ^ 



(9.36) 
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Solving for 6 yields the constrained least squares estimate 

e = (A^A + AS) A^b, (9.37) 

where A is yet to be determined. 

In order to find A, we can impose the quadratic constraint directly by substi- 
tuting (9.37) into (9.34), which leads to 

b^A (A^A + AS) S (A^A -f AS) A^b = 0. (9.38) 

With eigenvalue analysis, the matrix A^AS can be decomposed as 

A^AS = UAU-‘, (9.39) 

where A = diag(7i, ...,74) and 7,, i = 1,...,4, are the eigenvalues of the 

matrix A^AS. Substituting (9.39) into (9.38), we may rewrite the constraint 
as: 

p^(A + AI)“2q = 0, (9.40) 

where 

p = U"^SA^b, 

q = U^A^b. 



Define a function of the Lagrange multiplier as follows 



/(A) - p^(A + AI)-2q 
4 



(9.41) 



This is a polynomial of degree six and because of its complexity numerical 
methods need to be used for root searching. Since the root of (9.41) for A is 
not unique, a two-step procedure will be followed such that the desired source 
location could be found. 



5.4.1 Unconstrained Spherical Least Squares Estimator. In the first 
step, we assume that Xs, ys, Zg, and are mutually independent or equivalently 
disregard the quadratic constraint (9.34) in purpose. Then the LS solution 
minimizing (9.25) for 9 (the source location as well as its range) is given by 

9i = A^b, (9.42) 



At = (A^A) ^ A^ 



where 
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is the pseudo-inverse of the matrix A. 

A good parameter estimator first and foremost needs to be unbiased. For 
such an unconstrained spherical least squares estimator, the bias and covari- 
ance matrix can be approximated by using the following perturbation analysis 
method. 

When measurement errors are present in the range differences, A, b, and the 
parameter estimate 9i deviate from their true values and can be expressed as: 

A = A‘ + AA, b = + Ab, 0i = + A0, (9.43) 

where variables with superscript t denote the true values which also satisfy 

= A‘^b‘. (9.44) 

If the magnitudes of the perturbations are small, the second-order errors are 
insignificant compared to their first-order counterparts and therefore can be 
neglected for simplicity, which then yields: 

AA = [ 0 1 e ] , Ab « -d‘ © €, (9.45) 

where 0 denotes the Schur (element-by-element) product. Substituting (9.43) 
into (9.42) gives, 

(A^ + AA)'^ (A‘ + AA) (6»' 4- A0) 

= (A‘ -F AA)^ (b‘ 4- Ab) . (9.46) 

Retaining only the linear perturbation terms and using (9.44) and (9.45) pro- 
duces: 

A0 -A‘^ (De) , (9.47) 

where, 

D = diag(Di,Z)2,---,^N), 

is a diagonal matrix. Since the measurement error e in the range differences has 
zero mean, is an unbiased estimate of 9^ when the small error assumption 
holds: 

£;{A6>} « £:|-A°^De} = Ozixi- (9.48) 

The covariance matrix of A9 is then found as, 

= E{A9A9'^} = A^^DCeDA^^^ , (9.49) 

where Cg is known or is properly assumed a priori. Theoretically, the covari- 
ance matrix cannot be calculated since it contains true values. Neverthe- 
less, it can be approximated by using the values in 9\ with sufficient accuracy, 
as suggested by our numerical studies. 
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In the first unconstrained spherical LS estimate (9.42), the range information 
is redundant because of the independence assumption on the source location 
and range. If that information is simply discarded, the source location estimate 
is the same as the SI estimate but with less computational complexity [27]. To 
demonstrate this, we first write (9.42) into a block form as 



■ S'^S S^d ■ 


-1 


CO 

s 


d^S d^d 




[d^J 



It can easily be shown that: 



where 



■ S'^S 


S'^d ■ 


-1 


Q V 


d^S 


d^d 




k 



Q 

k 



S^d 

d^’d j d^d’ 

(S^S)“' [I - (s^d) v^], 

1 — (d^S)v 

d^M ■ 




(9.50) 



(9.51) 



Next, we define another projection matrix P jj . associated with the d-orthogonal 
space: 

Pdx=I-^, (9.52) 

and find 

, T , -1 S^d 

V = -(S^PdxS) (9.53) 

Q = (S^'P^rS)'*. (9.54) 



Substituting (9.51) together with (9.53) and (9.54) into (9.50) yields the uncon- 
strained spherical LS estimate for source coordinates, 

rs,i = (S^PdxS)"'s^Pdxb, (9.55) 

which is the minimizer of 

Ji(rs) = ||Pdxb-PaxSr,||2, (9.56) 

or the least-squares solution to the linear equation 

PjiSrs = Pdib. 



(9.57) 
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In fact, the first unconstrained spherical LS estimator tries to approximate the 
projection of the observation vector b with the projections of the column vectors 
of the microphone location matrix S onto the d-orthogonal space. The source 
location estimate is the coefficient vector associated with the best approxima- 
tion. Clearly from (9.57), this estimation procedure is the generalization of the 
plane intersection (PI) method proposed in [19]. 

By using the Sherman-Morrison formula [28] 



(A-fxy^) ^ = A-i 



A 'xy^A ^ 

1 -f- y'^’A“^x ’ 



we can expand the item in (9.55) as 



(S^PdiS)-' 



s^'s - 




(S^'d)^ 



(9.58) 



and finally can show that the unconstrained spherical LS estimate (9.55) is 
equivalent to the SI estimate (9.32), i.e. fs,i = 

Although the unconstrained spherical LS and the SI estimators are math- 
ematically equivalent, they are quite different in efficiency due to different 
approaches to the source localization problem. The complexities of the SI and 
unconstrained spherical LS estimators are in O (iV^) and 0( A), respectively. 
In comparison, the unconstrained spherical LS estimator reduces the complex- 
ity of the SI estimator by a factor of N^, which is significant when N is large 
(more microphones are used). 



5.4.2 Linear Correction. In the previous subsection, we developed the 
unconstrained spherical LS estimator (USLSE) for source localization and 
demonstrated that it is mathematically equivalent to the SI estimator but with 
less computational complexity. Although the USLSE/SI estimates can be accu- 
rate as indicated in [27] among others, it is helpful to exploit the redundancy of 
source range for improving the statistical efficiency (i.e., to reduce the variance 
of source location estimates) of the overall estimation procedure. Therefore, 
in the second step, we intend to correct the USES estimate d\ to make a better 
estimate 9 2 of 0. This new estimate should be in the neighborhood of 9i and 
should obey the constraint (9.34). We expect that the corrected estimate would 
still be unbiased and would have a smaller variance. 

To begin, we substitute 9\ = d*" + A0 into (9.36) and expand the expression 
to find 

A^^'A0i + AS01 - (A'^'A + AS)A0 - A^’b. (9.59) 

Combined with (9.42), (9.59) becomes 

(A^A + AS)A0 = AS01, 



(9.60) 
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and hence 

A0 = A(A^A)"^ E0‘. (9.61) 

Substituting (9.61) into 9\= 9^ + A9 yields 

01 = [l + A (A^A)~^ e] 9\ (9.62) 

Solving for 9*^ produces the corrected estimate 02 and also the final output of 
the linear correction least squares (LCLS) estimator: 

02 = [l + A (A^A)"^ 01 . (9.63) 

Equation (9.63) suggests how the second-step processing updates the source 
location estimate based on the first unconstrained spherical least squares result, 
or equivalently the SI estimate. If the regularity condition [29] 

lim (A(A^A)-^E)" = 0, (9.64) 

is satisfied, then the estimate &2 can be expanded in a Neumann series: 

02 = [^1+ (-A(A’^A)~^e) - h (-A(A^A)“^s)%--- 01 

00 

= ». + E 

n=l 

where the second term is the linear correction. Equation (9.64) implies that 
in order to avoid divergence, the Eagrange multiplier A should be small. In 
addition, A needs to be determined carefully such that 02 obeys the quadratic 
constraint (9.34). 

Because the function /(A) is smooth near A = 0 (corresponding to the 
neighborhood of 0i), as suggested by numerical experiments, the secant method 
[30] can be used to determine its desired root. Two reasonable initial points can 
be chosen as: 

Ao = 0, Ai = (9.66) 

where the small number /3 is dependent on the array geometry. Eive iterations 
should be sufficient to give an accurate approximation to the root. 

The idea of exploiting the relationship between a sound source’s range and its 
location coordinates to improve the estimation efficiency of the SI estimator was 
first suggested by Chan and Ho in [22] with a quadratic correction. Accordingly, 
they constructed a quadratic data model for 0i . 



-A(A^A) ^e] 01, 



(9.65) 



01 ©01 = T(rsOrs) 4- n, 



(9.67) 
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where © denotes the Schur (element-by-element) product, 



1 0 0 
0 1 0 
0 0 1 
1 1 1 



is a constant matrix, and n is the corrupting noise. In contrast to the linear 
correction technique based on the Lagrange multiplier, the quadratic counterpart 
needs to know the covariance matrix Cg of measurement errors in the range 
differences a priori. In a real-time digital source localization system, a poorly 
estimated Ce will lead to performance degradation. In addition, the quadratic- 
correction least squares estimation procedure uses the perturbation approaches 
to linearly approximate A0 and n in (9.43) and (9.67), respectively. Therefore, 
the approximations of their corresponding covariance matrices and Cn 

can be good only when the noise level is low. When noise is at a practically 
high level, the quadratic-correction least squares estimate has a large bias and 
a high variance. Furthermore, since the true value of the source location which 
is necessary for calculating and Cn cannot be known theoretically, the 

estimated source location has to be utilized for approximation. It was suggested 
in [22] that several iterations in the second correction stage would improve 
estimation accuracy. However, while the bias is suppressed after iterations, the 
estimate is closer to the SI solution and the variance is boosted, as demonstrated 
in [24]. Finally, the direct solutions of the quadratic-correction least-squares 
estimator are the squares of the source location coordinates (rg 0 fg). In 3-D 
space, these correspond to 8 positions, which introduce decision ambiguities. 
Other physical criteria, such as the domain of interest, were suggested but these 
are hard to define in practical situations, particularly when one of the source 
coordinates is close to zero. 

In comparison, the linear-correction method updates the source location es- 
timate of the first unconstrained spherical LS estimator without making any 
assumption about the error covariance matrix and without resort to a linear 
approximation. Even though we need to find a small root of function (9.41) 
for the Lagrange multiplier A that satisfies the regularity condition (9.64), the 
function /(A) is smooth around zero and the solution can be easily determined 
using the secant method. The linear-correction method achieves a relatively 
better balance between computational complexity and estimation accuracy. 



6. EXAMPLE SYSTEM IMPLEMENTATION 

Acoustic source localization systems are not necessarily complicated and 
need not use computationally powerful and consequently expensive devices for 
running in real time, as the implementation described briefly in this section 
demonstrates. The real-time acoustic source localization system with passive 
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microphone arrays for video camera steering in teleconferencing environments 
was developed by the authors at Bell Laboratories. Figure 9.2 shows a signal- 
flow diagram of the system. 

This system is based on a personal computer that is powered by an Intel 
Pentium® III 500 MHz general-purposed processor and that runs a Microsoft 
Windows® operation system. Sonorus AD/24 converter and STUDI/0® dig- 
ital audio interface card are employed to simultaneously capture multiple mi- 
crophone signals. The camera is a Sony EVI-D30 with pan, tilt, and zoom 
capabilities. These motions can be harmoniously performed at the same time 
by separate motors, providing good coverage of a normal conference room. 
The host computer drives the camera via two layers of protocols, namely the 
RS232C serial control protocol and the Video System Control Architecture 
(VISCA®) protocol. The focus of the video camera is updated four times a 
second and the video stream is fed into the computer through a video capture 
card at a rate of 30 frames per second. 

The microphone array uses six Lucent Speech Tracker Directional® hy- 
percardioid microphones, as illustrated in Fig. 9.3. The frequency response 
of these microphones is 200-6000 Hz and beyond 4 kHz there is negligible 
energy. Therefore microphone signals are sampled at 8 kHz and a one-stage 
pre-amplifier with the fixed gain 37 dB is used prior to sampling. The reference 
microphone 0 is located at the center (the origin of the coordinate) and the rest 
microphones are in the same distance of 40 cm from the reference. 

The system incorporates the adaptive eigenvalue decomposition algorithm 
for time delay estimation and the linear-correction least-squares algorithm for 
source localization. For comparison, several other time delay estimation and 
source localization approaches investigated respectively in the previous and 
current chapter have been also implemented. Subjective testings show that 
the cutting-edge system is successful and the new acoustic source localization 
technique is more robust to room reverberation and noise than earlier developed 
techniques. 

7. SOURCE LOCALIZATION EXAMPLES 

The empirical bias and standard deviation data in Figs. 9.4 and 9.5 show 
the results of two source localization examples using four different estimators 
developed in this chapter (a more comprehensive numerical study can be found 
in [24]). In the graphs of standard deviation, the CRLBs are also plotted. For 
the QCLS algorithm, the true value of the source location needs to be known to 
calculate the covariance matrix of the first-stage SI estimate. But this knowledge 
is practically inaccessible and the estimated source location has to be used for 
approximation. It is suggested in [22] that several iterations in the second 
correction stage could improve the estimation accuracy. In the following, we 
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Figure 9.2 Illustration of the real-time acoustic source localization system for video camera 
steering. 




Figure 9.3 Microphone array of the acoustic source localization system for video camera 
steering. 

refer to the one without iterations as the QCLS-i estimator and the other with 
iterations as the QCLS-ii estimator. 
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The microphone array designed for the real-time system presented above 
was used in these examples. As illustrated in Fig. 9.3, the six microphones are 
located at (distances in centimeters): 



ro = (0,0,0), ri = (40,0,0), F2 = (-40,0,0), 

rs = (0,0,40), = (0,-40,0), rg = (0,0, -40). 



For such an array, the value of in (9.66) was empirically set as ^ = 1. The 
source was positioned 300 cm away from the array with a fixed azimuth angle 
9s = 45° and varying elevation angles 7 s . At eaeh location, the empirical bias 
and standard deviation of each estimator were obtained by averaging the results 
of 2000-trial Monte-Carlo runs. 

In the first example, errors in time delay estimates are i.i.d. Gaussian with zero 
mean and 1 cm standard deviation. As seen clearly from Fig. 9.4, the QCLS- 
i estimator has the largest bias. Performing several iterations in the second 
stage can effectively reduce the estimation bias, but the solution is more like 
an SI estimate and the variance is boosted. In terms of standard deviation, all 
correction estimators perform better than the SI estimator (without correction). 
Among these four studied LS estimators, the QCLS-ii and the LCLS achieve the 
lowest standard deviation and their values approach the CRLS at most source 
locations. 

In the second example, measurement errors are mutually dependent and their 
covariance matrix is given by [22]: 



Ce = 



cr: 



2 1 
1 2 

1 1 



1 

1 

2 



(9.69) 



where again cr^ = 1 cm. For a more realistic simulation, all estimators are 
provided with no information of the error distribution. From Fig. 9.5, we see 
that the performance of each estimator deteriorates because errors are no longer 
independent. At such a noise level, the linear approximation used by the QCLS 
estimators is inaccurate and the estimation procedure fails. However, the LCLS 
estimation procedure makes no assumption about Ce and does not depend on 
a linear approximation. It produces an estimate whose bias and variance are 
always the smallest. 



8. CONCLUSIONS 

In this chapter, we have been on a short] ourney through the fundamental eon- 
cepts, several cutting-edge estimation algorithms, and some direet applications 
of acoustic source localization with passive microphone arrays. The localiza- 
tion problem was postulated from a perspective of the estimation theory and the 
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Figure 9.4 Comparisons of empirical bias and standard deviation among the SI, QCLS-i, 
QCLS-ii, and LCLS estimators with zero mean i.i.d. Gaussian errors of standard deviation 
Oe — 1 cm. (a) Estimators of Xs, (b) estimators of j/s, (c) estimators of Zg. 



Cramer-Rao lower bound for unbiased location estimators was derived. After 
an insightful review of conventional approaches ranging from maximum like- 
lihood to least squares estimators, we presented a recently developed linear- 
correction least-squares algorithm that is more robust to measurement errors 
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Figure 9.5 Comparisons of empirical bias and standard deviation among the SI, QCLS-i, 
QCLS-ii, and LCLS estimators with zero mean colored Gaussian errors of standard deviation 
cTj = 1 cm. (a) Estimators of Xs, (b) estimators of i/s> (c) estimators of z^. 



and that is more computationally as well as statistically efficient. Even though 
very few successful real-time acoustic source localization systems have been 
developed, to say that implementing such a real-time system needs fast and ex- 
pensive processors is to drastically overstate its complexity. We presented our 
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acoustic source localization system for video camera steering in teleconferenc- 
ing, which appealingly used only a cheap Intel Pentium III general-purposed 
processor. Acoustic source localization technique will play a significant role in 
the next-generation multimedia communication systems and a reliable solution 
will enable the use of many modern array signal processing technologies, most 
of which assume the knowledge of source locations. 

References 

[1] S. Haykin, “Radar array processing for angle of arrival estimation,” in Array Signal Pro- 
cessing, S. Haykin, Ed., Englewood Cliffs, NJ: Prentice-Hall, 1985. 

[2] H. Krim and M. Viberg, “Two decades of array signal processing research: the parametric 
approach,” IEEE Signal Processing Magazine, vol. 13, no. 4, pp. 67-94, July 1996. 

[3] R. J. Vaccaro, “The past, present, and future of underwater acoustic signal processing,” 
IEEE signal Processing Magazine, vol. 15, pp. 2 1-5 1 , July 1998. 

[4] D. V. Sidorovich and A. B. Gershman, “Two-dimensional wideband interpolated root- 
MUSIC applied to measured seismic data,” IEEE Trans. Signal Processing, vol. 46, no. 
8, pp. 2263-2267, Aug. 1998. 

[5] Y. Huang, J. Benesty, and G. W. Elko, “Microphone arrays for video camera steering,” 
in Acoustic Signal Processing for Telecommunication, S. L. Gay and J. Benesty, Eds., 
Boston, MA: Kluwer Academic, 2000. 

[6] H. Wang and P. Chu, “Voice source localization for automatic camera pointing system 
in videoconferencing,” in Proc. IEEE ASSP Workshop Appls. Signal Processing Audio 
Acoustics, 1997. 

[7] D. V. Rabinkin, R. J. Ranomeron, J. C. French, and J. L. Flanagan, “A DSP implementation 
of source location using microphone arrays,” in Proc. SPIE, vol. 2846, pp. 88-99, 1996. 

[8] C. Wang and M. S. Brandstein, “A hybrid real-time face tracking system,” in Proc. IEEE 
ICASSP, 1998, vol. 6, pp. 3737-3741. 

[9] D. R. Fischell and C. H. Coker, “A speech direction finder,” in Proc. IEEE ICASSP, 1984, 
pp. 19.8.1-19.8.4. 

[10] H. F. Silverman, “Some analysis of microphone arrays for speech data analysis,” /FEE 
Trans. Acoust., Speech, Signal Processing, vol. 35, pp. 1699-1712, Dec. 1987. 

[11] J. L. Flanagan, A. Surendran, and E. Jan, “Spatially selective sound capture for speech 
and audio processing,” Speech Communication, vol. 13, pp. 207-222, Jan. 1993. 

[12] D. B. Ward and G. W. Elko, “Mixed nearfield/farfield beamforming: a new technique for 
speech acquisition in a reverberant environment,” in Proc. IEEE ASSP Workshop Appls. 
Signal Processing Audio Acoustics, 1997. 

[13] W. R. Hahn and S. A. Tretter, “Optimum processing for delay- vector estimation in passive 
signal arrays,” IEEE Trans. Inform. Theory, vol. lT-19, pp. 608-614, May 1973. 

[14] M. Wax and T. Kailath, “Optimum localization of multiple sources by passive arrays,” 
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-31, no. 5, pp. 1210-1218, 
Oct. 1983. 

[15] P. E. Stoica and A. Nehorai, “MUSIC, maximum likelihood and Cramer-Rao bound,” 
IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 720-740, May 1989. 




Source Localization 253 



[16] J. C. Chen, R. E. Hudson, and K. Yao, “Maximum-likelihood source localization and 
unknown sensor location estimation for wideband signals in the near-field,” /EfiE Trans. 
Signal Processing, vol. 50, pp. 1843-1854, Aug. 2002. 

[17] Y. Bard, Nonlinear Parameter Estimation, New York: Academic Press, 1974. 

[18] W. H. Foy, “Position-location solutions by Taylor-series estimation,” IEEE Trans. Aerosp. 
Electron. Syst., vol. AES-12, pp. 187-194, Mar. 1976. 

[19] R. O. Schmidt, “A new approach to geometry of range difference location,” IEEE Trans. 
Aerosp. Electron., vol. AES-8, pp. 821-835, Nov. 1972. 

[20] H. C. Schau and A. Z. Robinson, “Passive source localization employing intersecting 
spherical surfaces from time-of-arrival differences,” IEEE Trans. Acoust., Speech, Signal 
Processing, vol. ASSP-35, no. 8, pp. 1223-1225, Aug. 1987. 

[21] J. O. Smith and J. S. Abel, “Closed-form least-squares source location estimation from 
range-difference measurements,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 
ASSP-35, no. 12, pp. 1661-1669, Dec. 1987. 

[22] Y. T. Chan and K. C. Ho, “A simple and efficient estimator for hyperbolic location,” IEEE 
Trans. Signal Processing, vol. 42, no. 8, pp. 1905-1915, Aug. 1994. 

[23] Y. T. Chan and K. C. Ho, “An efficient closed-form localization solution from time dif- 
ference of arrival measurements,” in Proc. IEEE ICASSP, 1994, vol. II, pp. 393-396. 

[24] Y. Huang, J. Benesty, G. W. Elko, and R. M. Mersereau, “Real-time passive source local- 
ization: an unbiased linear-correction least-squares approach,” IEEE Trans. Speech Audio 
Processing, vol. 9, no. 8, pp. 943-956, Nov. 2001. 

[25] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, Englewood 
Cliffs, New Jersey: Prentice-Hall, 1993. 

[26] J. S. Abel and J. O. Smith, “The spherical interpolation method for closed-form passive 
source localization using range difference measurements echo cancelation,” in Proc. IEEE 
ICASSP, 1987, vol. 1, pp. 471-474. 

[27] Y. Huang, J. Benesty, and G. W. Elko, “Passive acoustic source localization for video 
camera steering,” in Proc. IEEE ICASSP, 2000, vol. 2, pp. 909-912. 

[28] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithms, Upper Saddle 
River, NJ: Prentice-Hall, 1999. 

[29] C. D. Meyer, Matrix Analysis and Applied Linear Algebra, Philadelphia, PA: SIAM, 2000. 

[30] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in 
C: The Art of Scientific Computing, Cambridge: Cambridge University Press, 1988. 




Chapter 10 



BLIND SOURCE SEPARATION FOR 
CONVOLUTIVE MIXTURES: 

A UNIFIED TREATMENT 



Herbert Buchner 

University of Erlangen-Nuremberg 

buchner@LNT.de 



Robert Aichner 

University of Erlangen-Nuremberg 

aichner@LNT.de 



Walter Kellermann 

University of Erlangen-Nuremberg 

wk@LNT.de 



Abstract Blind source separation (BSS) algorithms for time series can exploit three prop- 
erties of the source signals: nonwhiteness, nonstationarity, and nongaussianity. 
While methods utilizing the first two properties are usually based on second-order 
statistics (SOS), higher-order statistics (HOS) must be considered to exploit non- 
gaussianity. In this chapter, we consider all three properties simultaneously to 
design BSS algorithms for convolutive mixtures within a new generic frame- 
work. This concept derives its generality from an appropriate matrix notation 
combined with the use of multivariate probability densities for considering the 
time-dependencies of the source signals. Based on a generalized cost function we 
rigorously derive the corresponding time-domain and frequency-domain broad- 
band algorithms. Due to the broadband approach, time-domain constraints are 
obtained which provide a more detailed understanding of the internal permutation 
problem in traditional narrowband frequency-domain BSS. For both, the time- 
domain and the frequency-domain versions, we discuss links to well-known and 
also to novel algorithms that follow as special cases of the framework. More- 
over, we use models for correlated spherically invariant random processes (SIRPs) 
which are well suited for a variety of source signals including speech to obtain 
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efficient solutions in the HOS case. The concept provides a basis for off-line, on- 
line, and block-on-line algorithms by introducing a general weighting function, 
thereby allowing for tracking of time-varying real acoustic environments. 

Keywords: Blind Source Separation, Convolutive Mixtures, Second-Order Statistics, Higher- 

Order Statistics, Time Domain, Frequency Domain, Broadband Approach, Spher- 
ically Invariant Random Processes 



1. INTRODUCTION 

The problem of separating convolutive mixtures of unknown time series 
arises in several application domains, a prominent example being the so-called 
cocktail party problem, where we want to recover the speech signals of multiple 
speakers who are simultaneously talking in a room. The room will generally be 
reverberant due to reflections on the walls, i.e., the original source signals Sq{n), 
q - 1 ,..., Q of our separation problem are filtered by a linear multiple input 
and multiple output (MIMO) system before they are picked up by the sensors. 
Most commonly used BSS algorithms are developed under the assumption that 
the number Q of source signals Sq{n) equals the number P of sensor signals 
Xp(n). However, the more general scenario with an arbitrary number of sources 
and sensors can always be reduced to the standard BSS model (Fig. 10.1). The 
case that the sensors outnumber the sources is termed overdetermined BSS 
{P > Q)- The main approach to simplify the separation problem in this case is 
to apply principle component analysis (PCA) [ 1 ], extract the first P components 
and then use standard BSS algorithms. The more difficult case P < Qis called 
underdetermined BSS or BSS with overcomplete bases. Mostly the sparseness 
of the sources in the time-frequency domain is used to determine clusters which 
correspond to the separated sources (e.g., [2]). Recent developments showed 
that the sparseness can be exploited to eliminate Q- P sources, and then again 
standard BSS algorithms can be applied [3]. 

Throughout this chapter, we therefore regard the standard BSS model where 
the number Q of source signals Sq (n) equals the number of sensor signals Xp (n), 
p = 1, . . . , P (Fig. 10.1). An M-tap mixing system is thus described by 

p M-l 

- k), ( 10 . 1 ) 

q=l K=0 



where /igp(/«), k = 0, . . . , M — 1 denote the coefficients of the finite impulse 
response (FIR) filter model from the q-th source to the p-th sensor. 

In BSS, we are interested in finding a corresponding demixing system accord- 
ing to Fig. 10.1, where the output signals yq{n), q - 1,..., P are described 
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mixing system H demixing system W 



Figure 10. 1 Linear MIMO model for BSS. 



by 

P L-l 

P=1 K=0 

The separation of the mixtures obtained hy the sensor signals Xp{n) utilizes 
the fundamental assumption of statistical independence between the original 
source signals Sq{n). It can be shown (see, e.g., [1]) that the MIMO demixing 
system coefficients Wpq{K) can in fact reconstruct the sources up to an unknown 
permutation of their order and an unknown filtering of the individual signals, 
where the demixing filter length L should be chosen at least equal to M. It 
should be stressed that the filtering ambiguity prevents a deconvolution of the 
sensor signals and therefore BSS achieves a mere separation of statistically 
independent signals. 

From the description of the BSS model (see Fig. 10.1) it can be seen that 
this technique is closely related to adaptive beamforming. This relationship 
was first shown in [4] where BSS was also termed blind beamforming. Thus, 
as an inherent advantage of BSS, prior knowledge of the spatial position of 
the sensors and sources is not necessary and, therefore, BSS is robust against 
unknown array deformations or distortions of the wavefront. Another important 
difference is the optimization criterion in BSS which utilizes the statistical 
independence of the source signals. Thus, adaptation of the demixing system 
is possible even if all source signals are simultaneously active in contrast to 
adaptive beamforming where the distinction between target signal activity and 
interfering signal activity has to be made [5]. However, one drawback of most 
BSS algorithms is that currently the number of sources has to be known for 
estimating the demixing system. 
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In [6] it was shown that merely decorrelating the ontpnt signals yq{n) does 
not lead to a separation of the sonrces. This implies that we have to force the 
ontpnt signals to become statistically deconpled np tojoint moments of a certain 
order by using additional conditiones. This can be realized by using approaches 
to blindly estimate the P^L MIMO coefficients Wpq{K) in (10.2) by exploiting 
one of the following sonrce signal properties [1]: 

(i) Nonwhiteness. Exploited by simnltaneons diagonalization of ontpnt cor- 
relation matrices over multiple time-lags, e.g., [7, 8]. 

(ii) Nonstationarity. Exploited by simultaneous diagonalization of short-time 
ontpnt correlation matrices at different time instants, e.g., [6], [9]-[17]. 

(iii) Nonganssianity. Exploited by nsing higher order statistics for indepen- 
dent component analysis (ICA), e.g., [18]-[23]. 

While there are several algorithms for convolntive mixtnres - both in the time 
domain and in the freqnency domain - ntilizing one of these properties, few 
algorithms explicitly exploit two properties [24, 25] and so far, none is known 
which simnltaneonsly exploits all three properties. However, it has recently 
been shown that in practical scenarios, the combination of these criteria can 
lead to improved performance [24, 25]. 

Extending the work in [26, 27], we present in the following a rigorons deriva- 
tion of a nnified framework for convolntive mixtnres exploiting all three signal 
properties by nsing HOS. This is made possible by introdncing an appropriate 
matrix notation combined with the nse of mnltivariate probability densities for 
considering the time-dependencies of the sonrce signals. The approach is snit- 
able for on-line and off-line algorithms as it nses a general weighting fnnction, 
thereby allowing for tracking of time-varying environments [28]. The process- 
ing delay can be kept low by working with overlapping and/or partitioned signal 
blocks [29]. Having derived a generic time-domain algorithm, we introdnce a 
model for spherically invariant random processes (SIRPs) [30] which are well 
snited, e.g., for speech to allow efficient realizations. Moreover, we discnss links 
to well-known SOS algorithms and we show that a previonsly presented algo- 
rithm [26] is the optimnm second-order BSS approach in the sense of minimnm 
mntnal information known from information theory. Enrthermore we introdnce 
an eqnivalent broadband formnlation in the freqnency domain by extending the 
tools of [31] to nnsnpervised adaptive filtering. This will also give a detailed 
insight in the internal permntation problem of narrowband freqnency-domain 
BSS. Again, links to well-known and extended HOS and SOS algorithms as spe- 
cial cases are discnssed. Moreover, nsing the so-called generalized coherence 
[32], links between the time-domain and freqnency-domain SOS algorithms 
can be established [26] showing that onr cost fnnction leads to an npdate eqna- 
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tion with an inherent normalization. As shown hy experimental results, this 
allows an efficient separation of real-world speech signals. 

2. GENERIC BLOCK TIME-DOMAIN BSS 

ALGORITHM 

In this section, we first introduce a general matrix formulation as a basis for 
a rigorous derivation of time-domain algorithms from a cost function which 
inherently takes into account all three fundamental signal properties (i)-(iii). 
We then consider the so-called equivariance property in the convolutive case 
for deriving the corresponding natural gradient update. From this formulation, 
several well-known and novel algorithms follow as special cases. 

2. 1 MATRIX NOTATION FOR CONVOLUTIVE 
MIXTURES 

From Fig. 10.1, it can he seen that the output signals yg(n) are obtained by 
convolving the input signals Xp{n) with the demixing filter coefficients Wpq. 
In addition to the filter length L and the number of channels P we need to 
introduce two more parameters for the following general formulation: 

■ the number of time-lags D taken into account for exploiting the non- 
whiteness property of the input signals as shown below (I < D < L), 
and 

■ the block length N as basis for averaging the estimates of the multivariate 
probability density functions (pdfs) as used below (A > PD in general; 
N > D for the natural gradient update discussed below). 

To derive an algorithm for block processing of convolutive mixtures taking into 
account D time-lags, we first need to reformulate the convolution (10.2): 

p 

yqi'mj) = '^Xp{m,j)Wpq, (10.3) 

p=i 



where m denotes the block index, and J = 0, • • • , A — 1 is a time-shift index 
within a block of length N, and 

Xp{mJ) = [xp{rnL + j),. . . ,Xp{mL - 2L + I + j)], (10.4) 

YgimJ) = [VciimL + j),. . . ,ijq{mL - D + I + j)\. (10.5) 

The 2L X D matrix Wp^ exhibits a Sylvester structure that contains all L 
coefficients of the respective demixing filter in each column needed for the 
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matrix formulation of the linear convolution 

W'pqfi 0 

'^pqA 

'^pq,L-l 

^pq= 0 Wpq^L^i 

0 0 WpqJ^-i 

0 • ■ ■ 0 0 

0 •••0 0 , 

It can be seen that for the general case, I < D < L, the last L - D + I rows 
are padded with zeros to ensure compatihility with the length of Xp(m,y) with 
regard to a concise frequency-domain formulation in Sect. 3. Finally, to allow 
a convenient notation of the algorithm we combine all channels, and thus we 
can write (10.3) compactly as 

y(m,j) = x(m,j)W, (10.7) 

with 

x(m,j) = [xi(m,i),...,xp(rn,j)], 

yimj) = [yi{rn,j),.. . ,yp(m,y)], 

■ Wu ••• Wip ■ 

w = ;•■.:• 

W PI • • • p p 

Also, with respect to the frequency-domain derivation in Sect. 3. we extend 
(10.7) hy collecting all N vectors Xp, y,,, so that all output signal samples of 



the m-th block are captured: 

Y(m) = X(rn)W, (10.11) 

with the matrices 

Y(m) = [Y,(m),--- ,Yp(m)], (10.12) 

X(m) = [Xi(m),--- ,X/.(m)], (10.13) 

Yg(w.) = [y,^ (m,0),... ,y' (m,iV - 1)]^ (10.14) 

Xp(m) = [Xp (m,0), . . . ,Xp (m, iV - 1)]^ (10.15) 



( 10 . 8 ) 

(10.9) 

( 10 . 10 ) 



0 

0 

'^^pq,0 

Wpq^l . (10.6) 
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T* 

Superscript denotes the transposition of a vector or a matrix. Obviously, 
Xp(m),p= in (10.15) are Toeplitz matrices of size {N x 2L) due to 

the shift of subsequent rows by one sample each: 

Xp{mL) ■■■ Xp{mL-2L + l) 

Xp{mL + 1) Xp{mL-2L + 2) 

Xp{mL + iV - 1) • • • Xp{mL — 2L + N) 

Analogously to supervised block-based adaptive filtering [29, 31], the ap- 
proach followed here can also be carried out with overlapping and/or partitioned 
data blocks to increase the convergence rate and to reduce the signal delay. Over- 
lapping is done by simply replacing the time index mL in the equations by 
with the overlap factor 1 < a < L. For clarity, we will omit the overlap factor 
and will point to it when necessary. 

2.2 COST FUNCTION AND ALGORITHM 
DERIVATION 

A generic SOS algorithm for convolutive mixtures has been derived in [26] 
from a cost function that explicitly contains correlation matrices that include 
several time-lags (c.f property (i)) under the assumption of short-time sta- 
tionarity (c.f property (ii)). Additionally, for exploiting property (iii), higher 
order statistics have to be considered. Higher-order approaches for BSS can 
be divided into three classes [1]: maximum likelihood (ML) estimation [21], 
minimization of the mutual information (MMI) among the output signals [33], 
and maximization of the entropy (ME/‘infomax’) [23]. Although all of these 
HOS approaches lead to similar update rules, MMI can be regarded as the most 
general one [33]. 

Based on a generalization of Shannon’s mutual information [34], we now 
define the following cost function which simultaneously accounts for the three 
fundamental properties (i)-(iii): 

00 7V-1 

^ {log(pi,£,(yi(e,j)) • ... •pp,£»(yp(*,y))) 

1=0 j=0 

- log (ppD (y 1 (l i) , . • ■ , y p (l j ) ) } , ( i o. 1 7) 

where is the estimated or assumed multivariate probability density func- 

tion (pdf) forchannel p of dimension D andpp£>(-) is thejoint pdf of dimension 
PD over all channels. Furthermore, D is the memory length, i.e., the number 
of time-lags to model the nonwhiteness of the P signals as above. Note also that 
the time series of these pdf estimates completely describes any multichannel 



(10.16) 
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stochastic process with the assumption of short-time stationarity over length- 
N blocks (this assumption is reasonable for many real-world signals such as 
speech). The expectation operator of the mutual information [34] is replaced in 
(10.17) by short-time averages within these blocks. /3 is a window function with 
finite support that is normalized according to ~ allows 

off-line, on-line, and block-online implementations of the algorithms. As an 
example, /3{i,m) = (1 — A)A"‘“*forO < i < m,and;0(?,m) = Oelse, leads to 
an efficient on-line version allowing for tracking in time-varying environments 
[28]. 

In this chapter, we consider algorithms based on first-order gradients. An ex- 
tension to higher-order gradients would be straightforward but computationally 
more expensive. Moreover, to obtain general expressions allowing a smooth 
transition to the frequency domain, we consider complex signals for the deriva- 
tive. In order to calculate the gradient [35, 36] 






dj{m) 
~dW* ’ 



(10.18) 



we need to express the cost function (10.17) in terms of the de mixing matrix W 
which contains the coefficients of all channels. A common way to achieve this 
is to transform the output signal pdfpp£>(y) into the PD-dimensional input 
signal pdf using W which is considered as a mapping matrix for this linear 
transformation. This procedure is directly applied to the second term in the 
braces of (10.17), followed by differentiation w.r.t. W. The derivative of the 
input signal pdf, which appears as an additive constant due to the logarithm, 
vanishes as it is independent of W. The argument of the logarithm in the first 
term in the braces, however, is factorized among the channels. Therefore, we 
apply the chain rule in this case, rather than transforming the pdfs. 

Finally, the generic HOS gradient for the coefficient update utilizing all three 
signal properties (i)-(iii) can be expressed as 



„ oo N—l 

VwJ(m) = 

i=0 j=0 

~^2LPxDP (^^^2kPxDp) I > (10.19) 

with the multivariate score function 

^Pi,p(yi(ij)) dpp.pjypjij)) 

dyi(ij) dyp(ij) 

pi,D(yi(*,j))’”’’ pp,i>(yp(*,j)) 






(10.20) 
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and the 2LP x DP window matrix ^ 2 LPxDp defined as 

= bdiag{w‘f;„,...,wj”»„}, (10.21) 

^2Lx£) ~ [^DxD,0^2L-D)xd] ■ (10.22) 

The operator bdiag{Ai, . . . , Ap} denotes a block-diagonal matrix with sub- 
matrices Ai, . . . , Ap on its diagonal. For the description of window matrices 
(also appearing in the frequency-domain algorithms in Sect. 3.) we use the 
following conventions: 

■ The lower index of a matrix denotes its dimensions. 

■ P-channel matrices (as indicated by the size in the lower index) are 
partitioned into P single-channel window matrices. 

■ The upper index describes the positions of ones and zeros. Unity sub- 
matrices are always located at the upper left (TO’) or lower right (‘01’) 
corners of the respective single-channel window matrix. The size of these 
clusters is indicated in subscript (e.g., ‘OIl’). 

The window matrix ^ 2 ip^Dp appears due to the transformation of pdfs by the 
non-square Sylvester matrix W [37]. 

With an iterative optimization procedure, the current demixing matrix is 
obtained by the recursive update equation 

W(m) = W(m - 1 ) - /iAW(m), ( 10 . 23 ) 

where // is a stepsize parameter, and AW (m) is the update which is set equal to 
Vw»7(ttt) for gradient descent adaptation. Due to the adaptation process, the 
coefficient matrix becomes time-variant. For clarity we will generally omit the 
block index of W and will point to it when necessary. Note that the Sylvester 
structure (see Eqs. 10.6, 10.10) of the update in (10.23) has to be ensured. 
(The structure of the update might be disturbed by imprecision effects and also 
(depends on the technique used for estimating the pdfs.) A simpleremedy for this 
generic update is to pick the first column and replicate it. For special cases, and 
frequency-domain versions discussed later, we will give more specific solutions 
for enforcing this constraint. 

2.3 EQUIVARIANCE PROPERTY AND NATURAL 
GRADIENT 

It is known that stochastic gradient descent, i.e., AW(m) = 
suffers from slow convergence in many practical problems due to statistical 
dependencies in the data being processed. 
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In the BSS application, we can show that the separation performance of the 
gradient update rule (10.19), (10.23) depends on theMIMO mixing system. The 
mixing process can he described analogously to (10.7) hy x(m, j) = s(m, j)H, 
where s{m,j) is the corresponding 1 x P(M + L-1) source signal row vector 
and H is the P(M + L - 1) x 2PL mixing matrix in Sylvester structure. The 
dimensions result from the linearity condition of the convolution. Due to the 
inevitable filtering ambiguity in convolutive BSS (e.g., [1]), it is at best possible 
to obtain an arbitrary block diagonal matrix C = HW, i.e., C - bdiag C = 0 , 
where C combines mixing and unmixing coefficient matrices. This means the 
output signals can become mutually independent but the output signals are still 
arbitrarily filtered versions of the source signals. To see how (10.19) behaves, 
we pre-multiply both sides of (10.19) by H. This way it can easily be shown 
thatC(m) depends on the mixing system H, and, therefore, on its conditioning. 

Fortunately, a modification of the ordinary gradient, termed the natural gra- 
dient by Amari [20] and the relative gradient by Cardoso [21] (which is equiv- 
alent to the natural gradient in the BSS application) has been developed that 
largely removes all effects of an ill-conditioned mixing matrix H assuming an 
appropriate initialization of W. The idea of the relative gradient is based on 
the equivariance property. Generally speaking, an estimator behaves equiv- 
ariantly if it produces estimates that, under data transformation, are similarly 
transformed. A key property of equivariant estimators is that they exhibit uni- 
form performance. In [26] the natural/relative gradient has been extended to 
the case of Sylvester matrices yielding 

J = WW^Vw J- (10.24) 

Together with (10.19) this immediately leads to the following expression: 

n OO N— 1 

V^Gj(m) = ^ ^ W{y''(i,;)$(y(*,;)) ~I}, (10.25) 

i=0 j=0 

which is then used as update AW in (10.23). 

In the derivation of the natural gradient for instantaneous mixtures, the fact 
that the demixing matrices form a so-called Lie group has played an important 
role [20]. However, the block-Sylvester matrices W after (10.6), (10.10) do 
not form a Lie group (as they are generally not invertible) . To see that the above 
formulation of the natural gradient is indeed justified, we again pre-multiply 
the update (10.25) with H, which leads to 

„ OO N-l 

AC(m) = ^ C{i) - 1 } • 

i=0 j=0 

Thus, the temporal evolution of C = C(m) depends only on the estimated 
source signal vector sequence and the stepsize /i, and the dependency on the 
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mixing matrix H has been absorbed as an initial condition into C (0) = HW(0) 
leading to the desired uniform performance of (10.25) proving the equivariance 
property of the natural gradient. 

Another well-known advantage of using the natural gradient is a reduction of 
the computational complexity of the update as the inversion of the PD x PD 
matrix w"v5»“ ^P£)in (10.19) need not be carried out in (10.25). Further- 
more, it can be shown for specific pdfs (Sect. 2.4) that instead of A > PD the 
condition A > Z) is sufficient for the natural gradient update due to the smaller 
matrices to be inverted [26]. 

Moreover, noting that the products of Sylvester matrices Wp^ and the re- 
maining matrices in the update equation (10.25) can be described by linear 
convolutions, they can be efficiently implemented by a fast convolution. 

The update in (10.25) represents a so-called holonomic algorithm as it im- 
poses the constraint {i,j)^{y{i,j)) = Ion the magnitudes of the recovered 
signals. However, when the source signals are nonstationary, these constraints 
may force a rapid change in the magnitude of the demixing matrix leading to 
numerical instabilities in some cases (see, e.g., [19]). Replacing I in (10.25) by 
the term bdiag{y^(t, j)$(y(i, j))} yields the nonholonomic natural gradient 
algorithm with improved convergence characteristics for nonstationary sources: 

„ oo N—l 

W^Jim) = -^/9(*,m)5]w{y^(*,j)^(y(2,j)) 

i=0 j=0 

-bdiag{y^(i,j)$(y(i, j))}} . (10.26) 

Here, the bdiag operator sets all channel-wise cross-terms to zero. Note that 
the nonholonomic property can also be directly taken into account in the cost 
function as shown in [27]. 

2.4 SPECIAL CASES AND LINKS TO KNOWN 
TIME-DOMAIN ALGORITHMS 

The update rules (10.19) and (10.25) provide a very general basis for BSS of 
convolutive mixtures. However, to apply them in a real-world scenario, an ap- 
propriate multivariate score function (10.20) has to be determined, i.e., we have 
to handle P high-dimensional multivariate pdfs Pp, oiy p{i, j))^ P = I, ■■■ ,P. 
In general, this is a very challenging task, as it includes all corresponding 
higher-order cumulants (including time-lags which may be on the order of sev- 
eral hundred in real acoustic environments). 

In the following we will present an efficient solution for these problems by 
assuming so-called spherically invariant random processes (SIRPs). Moreover 
we will show some links to SOS algorithms. Without loss of generality we 
consider now the case P = Q = 2 for simplicity. 
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2.4.1 Incorporating Spherically Invariant Random Processes (SIRPs) 
as Signal Model. The SIRP models are representative for a wide class of 
stochastic processes. It has been shown that speech signals in particular can 
very accurately he represented hy SIRPs [30]. One of the great advantages 
arising from the SIRP model is that multivariate pdfs can he derived analytically 
from the corresponding univariate prohahility density function together with 
the correlation matrices including time-lags. The correlation matrices can he 
estimated from the data while for the univariate pdf, we can assume one of 
the well-known functions for speech signals, e.g., the Laplacian density, or we 
can estimate the univariate pdf as well, based on parameterized representations, 
such as the Gram-Charlier or Edgeworth expansions [18]. 

The general model of a correlated SIRP of Z)-th order for channel p is given 
with a properly chosen function fp^oi’) by [30] 



Pp,Z)(yp(i,j)) = 



^7r^det(Rypyp(f)) 



Jp,D (yp(^,i)R-ypVp(*)y?(*>2)) 



(10.27) 



with the D X D correlation matrix Rypy, defined as 



N-l 



R 



ypYg' 



y?(^7)y<7(Ei) = ;^Y"(i)Yg(t). (10.28) 



j=o 



As the best known example, the multivariate Gaussian can be viewed as a 
special case of the class of SIRPs. To calculate the score function for SIRPs in 
general, we employ the chain rule [36] to Eq. 10.27 

dpp,D{yp{i,j)) 

dypjij) 

ppMypihj)) 



dfp,D{up) 



fp,D{ttp) dup 



yp(*.i)RypVp(*)> (‘0-29) 



where Up = ypRy^y^y^. Eor convenience, we call the scalar function 
4‘p,D{ti'p) the SIRP score of channel p. 

Having derived the multivariate score function for the SIRP model (10.29), 
we can now introduce it into the generic HOS natural gradient update equation 
(10.25) with its nonholonomic extension. In the 2-by-2 case, this leads to the 
following expression for the nonholonomic HOS-SIRP update: 



AW(m) = 2^/3(i,m)W 



i=0 



I ^ y 2 yi (*) R ; 



^yiy2(0R-; 



yiyi 



(0 



y2y2 



(*) 



(10.30) 
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where the modified matrices Ry^y,, p g are given hy 



R 



ypy« 




N-l 

XI Ko (yq(i,i)RyJy,(*)yf (t,;)) Yp {i,j)yq{i,j), 



j=0 



4>q,D{Uq) — 



fq,D^'^<d 

fq,D{Uq) 



(10.31) 

(10.32) 



The SIRP score 4‘q,Di'^q) of channel q in (10.31) is a scalar value function 
which causes a weighting of the correlation matrix. 

From the update equation (10.30), we see that the SIRP model leads to an 
inherent normalization hy the auto-correlation suhmatrices. 

To derive a HOS-SIRP realization using (10.32) we need an analytical ex- 
pression of the multivariate pdfs (10.27) for all channels. As noted above, for 
SIRPs, these expressions can actually he derived from the univariate pdfs [30]. 
Following the procedure in [30], we obtain, e.g., as the optimum SIRP score for 
univariate Laplacian pdfs [27]: 



^q,D['^q) — 






l(\/2Uq) 



(10.33) 



^D/2(\/2U|)) 



where K^( ) denotes the rz-th order modified Bessel function of the second 
kind. 



2.4.2 Generic BSS Based on Second-Order Statistics. To see the link to 
second-order BSS algorithms we use the model of multivariate Gaussian pdfs 
in the general cost function (10.17). As for Gaussian pdfs the cost function 
reduces to SOS we only utilize the nonstationarity and the nonwhiteness of the 
source signals. We now insert the multivariate Gaussian pdf 

PpMypiiJ)) = , . y.. (10.34) 

y (27r)^det(Rypyp (i)) 

in the natural gradient update equation of the generic HOS BSS algorithm 
(10.25). Note that there are several different representations of real and complex 
Gaussian multivariate pdfs in the literature [37, 38]. The most important ones in 
practice being the real case for speech and audio applications, and the rotation- 
invariant complex case mostly used in communication theory. In both cases the 
elements of the score function ^{y{i,j)) for a Gaussian pdf reduce to 



dpp,D(yp{i,j)) 

dyp(i,j] 

Pp,0(yp(Aj)) 



= yp{ij)^ypyp{i)- 



(10.35) 
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With (10.25) and (10.35) we finally obtain the natural gradient update of the 
generic SOS BSS algorithm originally introduced in [26] 

oo 

V^jT'(m) = 2 rn)W {Ryy(i) - bdiagRyy(i)} bdiag“^ R-yy (*) 

i=0 

(10.36) 

with the PD x PD short-time correlation matrix Ryy(t) defined as 
1 1 

Ryy(^) = TV ^ (10.37) 

j=0 



For the 2x2 case we can express (10.36) as 



J(m) = 2^^(i,m)W 
i=0 



Ryiy2(0Ry2y2(^) 

Ry2yi (*)RyiVi ^ 



(10.38) 

This generic SOS algorithm leads to very robust practical solutions even for 
a large number of filter taps (see below) due to an inherent normalization by 
the auto-correlation matrices Rypyp as known from the recursive least-squares 
(RLS) algorithm in supervised adaptive filtering [35]. Again, it is important to 
note that the products of Sylvester matrices Wpg and the remaining matrices 
in the update equation (10.38) can be described by linear convolutions. Thus 
they can be efficiently implemented by a fast convolution as in [25]. 

Moreover, by comparing (10.38) to the HOS-SIRP update (10.30), it can be 
seen that due to the fact that only SOS are utilized we obtain the same update 
with the nonlinearity omitted, i.e., (j>q^D{uq) = 1. 9 = 1, . . . , P. 

The original derivation [26] of the generic SOS natural gradient update 
(10.36) was based on a generalization of the cost function of [10]: 



OO 

jT’(m) = ^ /3(«, m) {log det bdiagRyy(2) - log dot Ryy('i)} . (10.39) 

■ 1=0 



In Fig. 10.2 the mechanism of the SOS cost function (10.39) is illustrated. 
By minimizing J{rti), all cross-correlations for D time-lags are reduced and 
will ideally vanish, while the auto-correlations are untouched. As both cost 
functions (10.17) and (10.39) lead to the same result in the SOS case, we may 
now conclude that the algorithm in [26] is in fact the optimum SOS algorithm 
for convolutive mixtures in the sense of minimum mutual information or ML, 
which also implies asymptotic Fisher-efficiency [1, 39]. 

Another interesting finding is that for both, the holonomic and nonholonomic 
versions of the HOS update (10.25), (10.26), the SOS BSS algorithm obtained 
by inserting the Gaussian pdf (10.34) turns out to be nonholonomic confirming 
its good performance for speech sources. 
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Figure lOJ Overview of time-domain algorithms based on second-order statistics. 



Note that in principle, there are two basic methods to estimate the output 
correlation matrices (10.37) for nonstationary output signals: the so-called 
correlation method, and the covariance method as they are known from linear 
prediction problems [40]. While the correlation method leads to a slightly 
lower computational complexity due to the Toeplitz structure of the matrices 
Ryy (and to smaller matrices, when implemented in the frequency domain 
covered in Sect. 3.), we consider the more accurate covariance method in this 
chapter. Note also that (10.37) is full rank since in general we assume N > PD. 

2.4.3 Approximations of the Generic BSS Based on Second-Order 
Statistics. The generic update (10.36) is now analyzed and links to known 
algorithms (see Fig. 10.3) are presented. We highlight here two realizations. 

For D - I, the correlation matrices Rypy, (*) become scalar values as only 
a single lag is considered for the correlations. Thus the resulting algorithm is 
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only taking the nonstationarity property into account. This was first proposed 
by Kawamoto et al. in [11]. 

In [24, 25], a time-domain algorithm was presented that copes very well with 
reverberant acoustic environments. Although it was originally introduced as a 
heuristic extension of [1 1] incorporating several time-lags, this algorithm can be 
directly obtained from (10.38) for D - L by approximating the auto-correlation 
matrices Ry,y, (*) by the output signal powers, i.e., 

^yQy,(*) = ^yq i^)yqii)^DxD (10.40) 

for g = 1, . . . , P, where yg{-) denotes the first column of Y^{-). Thus, this 
approximation is comparable to the well-known normalized least mean squares 
(NLMS) algorithm in supervised adaptive filtering approximating the RLS al- 
gorithm [35]. In addition to the reduced computational complexity, we can 
ensure the Sylvester structure of the update by using the correlation method 
[40] for calculation of the short-time correlation matrices Rypy^(t) resulting 
in Toeplitz matrices Rypy^(i). The remaining products of Sylvester matrices 
and Toeplitz matrices in the update equation (10.38) can again be efficiently 
implemented by a (fast) convolution as was done in [25]. 

Another very popular subclass of second-order BSS algorithms, particularly 
for instantaneous mixtures, is based on a cost function using the Frobenius norm 
II A||p = Yli,j of ^ matrix A == (ciij), e.g., [1, 7],[12]-[15]. Analogously to 
(10.39), this approach may be generalized for convolutive mixtures to 

oo 

JF(m) = J]^(f,m)||Y^(i)Y(f) -bdiagY^(0Y(0||p, (10.41) 

i-O 

which leads (after taking the natural gradient w.r.t. W in a similar way as in 
[26]) to the following update equation: 



V^J{m) = 2 J]/9(i,m)WRyy(f) 

2 = 0 



0 

Ry2yi (0 




(10.42) 



We see that this update equation differs from the more general equation (10.38) 
mainly in the inherent normalization expressed by the inverse matrices Ryp^yp ■ 
Thus, (10.42) can be regarded as an analogon to the least mean square (LMS) 
algorithm [35] in supervised adaptive filtering. However, many simulation re- 
sults have shown that for large filter lengths L, (10.42) is prone to instability, 
while (10.38) shows a very robust convergence behaviour (see Sect. 5.) even 
for hundreds or thousands of filter coefficients in BSS for real acoustic envi- 



ronments. 
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3. GENERIC EREQUENCY-DOMAIN BSS 
ALGORITHM 

Frequency-domain BSS is very popular for convolutive BSS since all tech- 
niques originally developed for instantaneous BSS can be applied indepen- 
dently in each frequency bin in the discrete Fourier transform (DFT) domain. 
Furthermore, the fast Fourier transform (FFT) can be used for an efficient im- 
plementation. Such narrowband approaches can be found, e.g., in [1], [3], [6], 
[9], [12]-[17], [22]. Unfortunately, the permutation problem, which is inher- 
ent in BSS (e.g., [1]), may then also appear independently in each frequency 
bin so that extra measures have to be taken to avoid this internal permuta- 
tion. Additionally, as discussed in Sections 2.3 and 2.4 the products involving 
Sylvester matrices in the time -domain update equations correspond to linear 
convolutions. Thus, in the narrowband frequency-domain approach these con- 
volutions become circular ones. The resulting wrap-around effects may limit 
the separation performance. Based on the above matrix formulation in the time 
domain, the following derivation of broadband frequency-domain algorithms 
shows explicitly the relation between time-domain and traditional frequency- 
domain algorithms, as well as some extensions. In contrast to the narrowband 
approach this inherently resolves the permutation ambiguity and prevents circu- 
lar convolution effects in the update equation. Moreover, as in the time-domain, 
(10.17) also leads to the very desirable property of an inherent stepsize normal- 
ization in the frequency domain which also becomes clear by a link with [17] 
for the SOS case. As pointed out in the previous section, the conditions for the 
parameters L, N, and D for the natural gradient adaptation are given by the 
relations N > D and \ < D < L. Therefore, we may assume N - L without 
loss of generality for the following derivation. 

3.1 GENERAL EREQUENCY-DOMAIN 
FORMULATION 

The matrix formulation (10.11) introduced for the time-domain in Sect. 2. 
allows a rigorous derivation of the corresponding frequency-domain BSS al- 
gorithms. In the frequency domain, the structure of the algorithm depends on 
the method chosen for estimating the correlation matrices. Here, we consider 
again the more accurate covariance method [40] (see Sect. 2.4.2). The matrices 
Xp(m) and Wp^, introduced in Sect. 2.1 are now diagonalized in two steps to 
Obtain frequency-domain representations. In the following, we mark frequency- 
domain quantities by an underbar. This does, however, not imply that they are 
simply DFTs of the corresponding time-domain quantities. Each quantity has 
to be transformed individually. We first consider the Lx2L Toeplitz matrices 
Xp(tn) . 
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Step 1: Transformation ofToeplitz matrices into circulant matrices. 

Any Toeplitz matrix Xp (10.16) can be transformed, by donbling its size, to a 
circnlant matrix Cxp{tti) [31]. In onr case we define the 4L x 4L circnlant 
matrix by taking into acconnt (10.16) by 

x;(m - 3) Xp(m - 1) ‘ 

Xp(m - 2) Xp(m) 

Xp(m - 1) X;(m - 3) ’ 

Xp(m) Xp(m - 2) 

where Xp(m — 3) is a properly chosen extension ensnring a circnlar shift of 
the 4L inpnt valnes in the first colnmn. It follows 

Xp(m) = (10.43) 

where we introdnced the windowing matrices 

<X4L = [Olx3Z.,Ilxl], 

= [hl.2L,02L.2Lf. 

This notation follows the conventions listed in Sect. 2.2. 

Step 2: Transformation of the circulant matrices into diagonal matrices. 
Using the 4L x 4L DFT matrix F4£,x4L> the circnlant matrices are diagonalized 
as follows: 

Cxp{m) = F'2x4A(m)F4ix4i,, 

where the diagonal matrices Xp(m) representing the freqnency-domain ver- 
sions of Xp(m), can be expressed by the first colnmns of Cxp(tn), 

Xp(m) = diag{F4z,x4L[2:p(mL-3L),...,3;p(mL- 1), 

Xp{mL),Xp{mL 4- 1), . . . ,Xp{mL + L - 1)]^}, (10.44) 

i.e., to obtain Xp(m), we transform the concatenated vectors of the cnrrent 
block and three previons blocks of the inpnt signals Xp{n). Here, diagja} 
denotes a sqnare matrix with the elements of vector a on its main diagonal. 
Now, (10.43) can be rewritten eqnivalently as 

Xp(m) = W°',^4^F,-2,4^Xp(m)F4Lx4t.wKL- 00-45) 

Eqnations (10.44) and (10.45) exhibit a form that is strnctnrally similar to that 
of the corresponding connterparts of the well-known (snpervised) freqnency- 
domain adaptive filters [31]. However, the major difference here is that we need 
a transformation length of at least 4L instead of 2L for an accnrate broadband 
formnlation. This shonld come as no snrprise, since in BSS nsing the covariance 



Cxpim) = 
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constrained 







method, both convolution and correlation is carried out where both operations 
double the transformation length. 

We now transform the matrices Wp^ in the same way as shown above for 
Xp. Thereby, we obtain 

Wp» = (10,46) 

where 

^4Lx£) = [IdxD,0£,x(4Z,-D)]^> 

= Il21x2i,,02tx2d=(w;>Ay’’, 

and the frequency-domain representation of the demixing matrix 

Wp, = diag{F4Lx4L[tOp9,o, • • • , tWp9,L-i, 0, . . . , 0]'^}. (10.47) 

Equation (10.46) is illustrated in Fig. 10.4. Note that the column vec- 
tor in (10.47) corresponds to the first column of the 4L x 4L matrix 
FiLx4£WpgF4f.x4f. in Fig. 10.4. Moreover, it can be seen that the pre- 
multiplied transformation W 2 £^^x* 4 L^^Lx 4 Z, (10-46) is related to the demixing 

filter taps in the first column of Wp^, while the post-multiplied transformation 
in (10.46), which we denote by 

(10,48) 

is related to the introduction of D time-lags (see also Sect. 3.3.1). Combining 
all channels, we obtain from (10.45) and (10.46) 



X(m) 








•bdiag{F4Lx4LW‘f,‘^2L.-- 


(10.49) 


W 


= bdiag{W^-VF4“Lx4L>'- 


■.WlL^4tF47x4Z,)WL, 



(10.50) 





274 Audio Signal Processing 



where ^{m) and W are defined analogously to (10.13) and (10.10), respec- 
tively. L denotes the 4LP x DP matrix 

L = bdiag{Lj£;^,...,Ll»;„}. 

From (10.11), (10.49), and (10.50) we further obtain 

Y(m) (10.51) 



with 

Y(m) = X(m)G^f;, 4 ^pW, 
and the time-domain constraints 

i Ql2i0 



'^ALPxiLP 

'^4Lx4i 



= bdiag{Gl?AV.--.G4'xV}. 



= F 






-I 



ALxAL W 4px4x,r 4£,^4p, 



•H/lsrO _ -ryhtO -utrhiO 
ALxAL — ALx2L^ 2LxAL 

hlx2L ^2Lx2L 
02Lx2L 02Lx2L 



(10.52) 



To formulate the cost function (10. 17) in the frequency domain, we first need 
to express it equivalently using matrices Yp, p = 1, . . . ,P. This inevitably 
leads to the introduction of pdfs which depend on matrices in their arguments. 
In general, such pdfs are determined by a fourth-order tensor which contains 
all cross-relations between the matrix elements. However, due to the Toeplitz 
structure of the matrices Yp aredundancy is introduced which neither appears in 
the cost function (10.17) nor leads to any improved results compared to (10.17). 

Thus, we can replace the tensor by a matrix containing only the desired 
information on the cross-relations between the D time-lags. This yields the 
following equivalent representation of (10.17): 



1 

J{m) = - {log(pi,A,xD(Yi(l)) • ... •pp,yvx£)(Yp(2))) 

i=0 

- log {pnxPd{Yi (i), Yp{i}))} , (10.53) 

with the auxiliary pdfs which we define here by 

N-\ 

Pp,NxD{Yp{i)) = Pp.zj(yp(i,j)), (10.54) 

3=0 

N-\ 

pNxPDiYi{i),...,Yp{i)) = JJ pp£>(yi(Fi),--- ,yp(*,7)), 



(10.55) 
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showing the relation to the multivariate pdfs. The equivalence to (10.17) can 
easily be verified by inserting (10.54) and (10.55) in (10.53). The advantage of 
introducing such auxiliary pdfs is that they can formally be handled like standard 
pdfs where the rows of the matrix in their argument are mutually statistically 
independent. This allows a compact representation of the following equations. 

To proceed with the derivation, we take the gradient of (10.53) w.r.t. the 
frequency-domain coefficient matrix W. This is done analogously to the time- 
domain derivation of (10.19). However, (10.51) and (10.52) have to be taken 
into account by using the chain rule for matrices [41]. This finally leads to the 
following gradient for the frequency-domain update: 



2 °° f 

VwJ(m) = -5^/3(i,m){G^fp°,,^pX"(z)$(Y(*)) 

2 = 0 

-L (L^W^L)"^L^| (10.56) 



with the frequency-domain score function 









.(Y,(^)) 



dp 



ay.x{i) 



P,4LX4L 






PYp(^) 



2i,4Lx4Z. 






2p,4Lx4L^— 



(10.57) 



Note that the pdf of tho frequency-domain matrix Y^(i) is 

obtained by transforming the pdf /vx£)(Yq(i)) of time-domain variables 
using (10.51). We will go into the precise formulation of p 
within the scope of the special cases treated in Sect. 3.3. Equations (10.56) and 
(10.57) are the generic frequency-domain counterparts of (10.19) and (10.20), 
respectively, and may be equivalently used for coefficient adaptation. 

As in the time-domain, we need not calculate the entire coefficient matrix W 
explicitly due to the redundancy introduced by the Sylvester structure in the time 
domain, and the diagonal structure of the submatrices Wp^ in the frequency 
domain, respectively. While the structure of matrix W is independent of D, 
matrix L introduces the number of time-lags taken into account by the cost 
function, as shown by (10.48) and (10.51) (see also Fig. 10.4). To calculate 
the separated output signals, given a demixing matrix W, we need to pick the 
first column of Y in (10.51) (the other columns were introduced in (10.11) 
for including multiple time-lags in the cost function). This is done by using 
L = Li = bdiag{l 4 Lxi) ■ • • , l4Lxi} in (10.51). Then, WL in that equation 
becomes a ALP x P matrix W ' whose columns correspond to the diagonals 
of W. As a general rule, 

W' = WLi, (10.58) 

and building diagonal submatrices YLpq of W using the entries ofW’, trans- 
forms the two equivalent representations into each other. Thus, to formally ob- 
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tain the update ofW^ needed for the output signal ealculation, we post-multiply 
(10.56) by Li, both simplifying the calculation of (10.56), and enforcing the 
diagonal structure of during the adaptation. This simplification results 
from the fact that we only have to operate with vectors rather than matrices for 
each channel when constructing the update equation from the right to the left 
in a practical realization. 

In addition to the diagonal structure of W, we have to ensure the Sylvester 
structure in the time domain as noted previously. As can be seen in Fig. 10.4, 
(10.47) determines the first column, and thus the whole 4L x 4L Sylvester 
matrix. In other words, we have to ensure that the time-domain column vector 
in (10.47) contains only L filter coefficients and 3L zeros. Therefore, the 
gradient (10.56) has to be constrained by Together with (10.58) 

this leads to 

AW'(m) = Gii°p, 4 ^pVw J(m)Li, (10.59) 

which may again be implemented efficiently from the right to the left. Then the 
constraint G^^p^^^p reduces to channel-wise inverse FFT, windowing (see 
also Sect. 3.3.2), and FFT operations. 

3.2 NATURAL GRADIENT IN THE EREQUENCY 
DOMAIN 

In Sect. 2.3, it has been shown that the natural gradient for convolutive 
mixtures introduced there for the time domain yields equivariant adaptation 
algorithms, i.e., the evolutionary behaviour of 

C(m) = HW(m) (10.60) 

and AC(m) = HAW(m) does not explicitly depend on H in (10.26). 

In this section, we investigate how this formulation of the natural gradient 
transforms into the frequency domain. To begin with, we start by the following 
approach containing arbitrary matrices Ai, A2, A3, and A4 of proper size: 

J = AiWA2A3W^A4Vw (10.61) 

Now, our task is to determine the four matrices A, such that the resulting 
eoefficient update exhibits desired properties. 

As a first condition, matrix AiWA 2A3W^A4 in (10.61) must be positive 
definite, i.e., all its eigenvalues must be positive to ensure convergence [39]. 
This determines matrices A3, and A4 up to a positive scalar constant, which 
can be absorbed in the stepsize, so that we obtain 

J = AiWA2A2^W^^Af Vw ( 10 . 62 ) 

As the second, and most important condition, it is required that the equivari- 
ance property is fulfilled. Combining (10.60) with (10.50), we obtain a relation 
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between C and the frequency-domain coefficients W. 

C = Hbdiag{'..}WL, 



and analogously 



AC = Hbdiag{---}AWL 
= Hbdiag{---}V^JL. 



(10.63) 



(10.64) 



As in the time domain (see (10.26)), it is required that (10.64) in combination 
with the natural gradient (10.62) can be expressed by C defined in (10.63), and 
therefore does not explicitly depend on H. This leads to the claim 

AC = Hbdiag{- • • }AiWA 2 Af W^Af Vw»/L, (10.65) 

' V ' 

=c 

and a comparison of (10.65) with (10.63) yields the matrices 

At = G^I"px4LP- A2 = L. (10.66) 

Note that Ai = I is not the general solution. This can be verified by insert- 
ing (10.66) in (10.65), and considering the argument of bdiag{.} acording to 
(10.50). 

Finally, we obtain the natural gradient 

= GliA.ipWLL«W«G;i,'p",,tpVwJ, (10.67) 

and together with (10.56) it follows the coefficient update 

2 oc) 

vf J(m) = - {Y^(i)^(Y(t)) - 1} . 

^ i=0 

(10.68) 

Note that in this equation the natural gradient shows again the convenient prop- 
erty of avoiding one matrix inversion. Formally, as in Sect. 3.1, (10.59) can be 
used to obtain AW'. 

3.3 SPECIAL CASES AND LINKS TO KNOWN 
EREQUENCY-DOMAIN ALGORITHMS 

The generic gradient (10.56) and generic natural gradient (10.68), respec- 
tively, exhibit three types of quantities that fully specify practical realizations 
which follow as special cases. These quantities can be related to the three 
fundamental signal properties, as shown in Table 10.1. 
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Table 10. 1 Quantities defining a certain frequency-domain algorithm. 



quantity 


related to 


examples in 


constraints GC, L 


nonwhiteness 


Sect. 3.3.1, 3.3.2 


score function ^( ) 


nongaussianity 


Sect. 3.3.1, 3.3.3 


weighting function j3{-) 


nonstationarity 


Sect. 4. 



3.3.1 The Constraints and the Internal Permutation Problem in 
Frequency-Domain BSS. Two types of constraints appear in the gradient 
(10.56) and in the natural gradient update (10.68): 

■ The matrices G... in (10.52) and in the update equations are mainly re- 
sponsible for preventing decoupling of the individual frequency compo- 
nents, and thus avoiding the internal permutation among the different 
frequency bins and circular convolution effects. 

■ Matrix L has two different functions: on the one hand, it allows joint 
diagonalization over D time-lags, and on the other hand, it acts as time- 
domain constraint similar to the matrices G... (see Fig. 10.4). 

Note that the constraints G'.'.] and L also appear in the score function •) as 
can be seen later (e.g.. Sect. 3.3.3) in more detail. 

Concerning matrix L we can distinguish between four different cases: 

a) Z) < L: As in the time domain, this choice allows the exploitation of the 
nonwhiteness property with up to D time-lags. 

h) D - L\ This is the optimum case as in the time domain. 

c) D > L\ This choice is not meaningful in the time domain. In the 
frequency domain, however, we can choose D up to the transformation 
length 4L due to the introduced circulant matrix, as shown in Fig. 10.4. 
For Z) > L the time-domain constraint is relaxed, which may also lead 
to a suboptimum solution. 

A) D - 4L: According to Fig. 10.4 this corresponds to the traditional nar- 
rowband approximation (apart from constraints GT) so that all matrices 
L cancel out in the update equations, which can also be verified using 
(10.48). 

Case d), i.e., neglecting matrix L in (10.56) yields a simplified gradient 

r) OO 

Vwv7(m) = miXii)) - W""} , 

(10.69) 
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Figure 10.5 Illustration of bin-wise decomposition for the 2-channel case. 



where ^ denotes the inverse of a conjugate transpose of a matrix, and from 
(10.68), we obtain a simplified natural gradient 



2 °° 

= jy {Y«(i)*(Y(i)) - I) . 

i=0 

( 10 . 70 ) 

Note that these expressions still largely avoid the well-known internal per- 
mutation problem of frequency-domain BSS using the constraints G;;; in the 
calculation of Y in (10.52) and in the update equations obtained from inserting 
(10.69) or (10.70) in (10.59). 

By additionally approximating G'.j as scaled identity matrices [31] in the 
gradients, the submatrices of Y in (10.69) and (10.70) also become diagonal, 
as illustrated in Fig. 10.5. Moreover, the frequency-domain multivariate score 
function ^(•) can be decomposed to frequency bin selective score functions 
containing only univariate pdfs for channel p, i.e.. 



$(«^)(YH(i)) 







9Y^^\i) 









( 10 . 71 ) 



where i/ = 0, . . . , 4L — 1 denotes the frequency bin index. This approximation 
combined with D - 4L (case d) from above) corresponds to the traditional 
narrowband approach. Only in this case both equations can be decomposed 
into its frequency components, i.e., we can equivalently write 

VwJ(‘')(m) = -^/3(i,m) (xM(i)) $H(yM(z)) - (wM)" [, 

( 10 . 72 ) 
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2 ^ r / \ H 

^ 1=0 

(10.73) 

respectively. In contrast to W and Y in (10.56), (10.68) which are ALP x ALP 
and 4L x ALP matrices, respectively, the corresponding matrices and 

in (10.72), (10.73) are only of dimensions P x P and 1 x P, respectively. 

The approximation (10.73) of the natural gradient corresponds to the IC A 
narrowband approach originally proposed hy Smaragdis [22] as an extension 
of the information maximization approach [23]. 

Note that the nonholonomic version of the natural gradient (10.73) can 
he obtained similarly to the time domain by replacing matrix I with 

diag 

To derive the update equations from the approximated gradients, we ap- 
ply again (10.59) which contains another constraint GIi^^alp transforming 
the filter coefficients back into the time domain, zeroing the last 3L values, and 
transforming the result back to the frequency domain. Thus, even if (10.72) and 
(10.73) can be efficiently computed in a bin-selective manner, this constraint 
prevents a complete decoupling of the frequency-components in the update 
equations. This procedure appears similarly in the well-known “constrained 
frequency-domain adaptive filtering” in the supervised case [35], [31]. In BSS, 
this theoretically founded mechanism largely eliminates the internal permuta- 
tion problem in a simple way. It was first heuristically introduced in [22], and 
also in [14]. A more detailed experimental examination on this constraint was 
reported in [42] confirming that the ratio between filter length L and transfor- 
mation length 4L - as obtained here analytically - yields optimum separation 
performance. However, due to the omission of the other constraints in the ap- 
proximated gradients we will not perfectly remove the permutation ambiguity 
as observed experimentally in [42]. Traditional narrowband approaches also 
neglecting the time-domain constraint in (10.59) need additional measures for 
solving the permutation problem (e.g., [13], [43]). 

3.3.2 Alternative Approximations of the Constraints. The generic al- 
gorithm (10.68) with its constraint matrices G::; suggests alternative efficient 
approximations to allow improved tradeoffs between the exact broadband ap- 
proach (large computational complexity) and the narrowband approach (inter- 
nal permutation ambiguity) by choosing certain efficient approximations of the 
constraints. 

Generally, we can distinguish between approximations depending on the 
block index and approximations within each block. One example for the former 
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a) Time-domain window function: 




b) iTcciucncy-domain representation: 



Real part (OFT length 2048) 
lOOOr ■ • ' 

500 f 




500' .... I 

0 10 20 30 40 50 

Irequency bins 



Imaginary part (DFT length 2048) 




Irequency bins 



Figure 10.6 Illustration of a smoothed window function for L = 512, i.c., transformation 
length 2048. Note that the window functions are circular. 



class is to simply apply the constraints periodically for a reduced number of 
blocks which has also been proposed for the supervised case [44]. 

The other class is based on efficient approximations of the rectangular win- 
dow appearing in the constraints. This is done by smoothing the rectangular 
window (Fig 10.6a) so that its frequency-domain representation can be well- 
described by a small number of coefficients (Fig 10.6b). Having such a repre- 
sentation, it is often more efficient to directly apply the convolution operation 
in the frequency-domain instead of going back and forth between the time do- 
main and frequency domain. This general idea has been discussed earlier for 
supervised adaptive filtering [45], especially after the introduction of the super- 
vised generic frequency-domain framework [46, 31], see, e.g., [47, 48]. There 
are several variations possible to design the smoothed window (see also filter 
design techniques) [49]. However, the smoothed window has to be flat within 
the length L (e.g., Tukey window [49]). Otherwise compensation terms are 
necessary [48]. In BSS, a similar windowing has been proposed heuristically 
in [50]. 

3.3.3 Generic Frequency-Domain BSS Based on SOS. As shown for 
the time domain, we derive a generic SOS algorithm by considering Gaus- 
sian pdfs. The corresponding Gaussian auxiliary pdf for matrices in the sense 
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described above is obtained using (10.54). It follows 



Pp,NxDC^p{i)) 



^ I f { Yp (ORypVp (0 Yp" (0 } . 

^((27t)i^detRy^yp(f))^ 



(10.74) 

Transforming this Gaussian pdf into the pdf for the corresponding frequency 
domain variables Yp gives again a Gaussian. Using (10.51) and (10.52) we 



obtain 
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The resulting score function (10.57) reads 



$(Y(t)) = • bdiag-i (L^Syy(t)L) L". (10.78) 



This leads to 

2 °° 

Vw2T(m) = — y~]/3(^,m)G4^p^4^pSxyL 

i=o 

■(L^SyyL)-'L^MSyy - bdiagSyyjL 

•bdiag-^ (L^SyyL) • L^^ (10.79) 

where 

Sxy« = Sxx(*)Glf;,4^pW. (10.80) 

Finally with (10.67), we obtain the natural gradient 



9 _ , 

V^J(m) = -y]^(^,m)G^fp“,4^pWLL^{Syy-bdiagSyy}L 
i=0 

•bdiag“i (L^SyyL)L^. 



(10.81) 
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Equations (10.79) and (10.81) are the SOS analoga to (10.56) and (10.68). In 
the same way as shown here for the Gaussian case, we could also analogously 
define auxiliary pdfs for SIRPs (see Sect. 2.4). Note that in (10.81) the natural 
gradient shows again the convenient property of avoiding one matrix inversion. 
Formally, as in Sect. 3.1, (10.59) can be used to obtain AW'. 



3.3.4 Approximation of the Generic Frequency-Domain BSS Based 
on SOS. In the SOS case we can apply the same approximation steps as 
discussed for the HOS case in Sect. 3.3.1. By analogously neglecting matrix L 
in (10.79) and (10.81) we obtain a simplified gradient and a simplified natural 
gradient, respectively, which still largely avoid the internal permutation problem 
of frequency-domain BSS. 

The narrowband approach is obtained by additionally approximating G]" as 
scaled identity matrices [31] yielding gradients which can be decomposed in 
its frequency components, i.e.. 



9 °° -1 

Vw ^ {^yy) - diag sJJ | diag“^ 

' i=0 

(10.82) 

and 

9 00 

^ {sg - diag diag- ' sJJ, 

1=0 

(10.83) 

respectively, where = 0, . . . , 4L — 1 denotes the frequency bins. In contrast 
to Sxy, Syy,and Win (10.79), (10.81) which are 4LP x ALP matrices each, 

the corresponding matrices Sxy\ Syy^ and W^"^ in (10.82), (10.83) are only 
of dimension P x P. 

To obtain the update equations from the approximated gradients, we apply 
again (10.59) preventing the complete decoupling by the constraint ^\^ipxalp- 
The approximated coefficient update (10.82) is directly related to some well- 
known frequency-domain BSS algorithms. In [16], an algorithm that is similar 
to (10.82) was derived by directly optimizing a cost function similar to the one 
in [10] in a bin-wise manner. More recently, Fancourt and Parra proposed in 
[17] to apply the magnitude-squared coherence 




(10.84) 



p,q € {1,2} as a cost function for frequency-domain BSS, where (m) 
denotes the (jO, 9 )-th element of Syy^(m), i.e., the power spectral density in the 
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i^-th bin and block m. The coherence (10.84) has the very desirable property 
that 

0<|7S,(m)|2<l, (10.85) 

which directly translates into an inherent stepsize normalization of the corre- 
sponding update equation [17]. In particular, = 0 ifyi and y 2 are 

orthogonal, and | 7 y^j/ 2 (’^)P “ 1 when yi = ay 2 for any non-zero complex 
number a. 

Comparing the update equation (10.82) with that derived in [17], we see that 
an additional approximation of ^Syy ^ as a diagonal matrix was used in [17], 
which results in 



= — ^]3(i,m)sLy(liag"' 

2=0 

• I Sy^ - diag Sy*^^ I diag“ ^ Sy"y . ( 1 0.86) 

The coherence function (10.84) applied in [17] can be extended to the case 
P > 2 by using the so-called generalized coherence [32]. In [26] a link between 
the SOS cost function (10.39) and the generalized coherence was established. 
This relationship allows a geometric interpretation of (10.39) and shows that 
this cost function leads to an inherent stepsize normalization for the coefficient 
updates. 

4. WEIGHTING FUNCTION 

In the generalized cost functions (10.17) and (10.39) a weighting function 
P(i, m) was introduced with the block time indices i, rn to allow different re- 
alizations of the algorithms. Based on the cost function we previously derived 
stochastic and natural gradient update equations in the time domain and fre- 
quency domain. Due to the similar structure of these equations, we will now 
consider only the time domain for simplicity. There, we can express the coef- 
ficient update as 

00 

AW(m) = ^/9(i,m)Q(i), (10.87) 

2 = 0 

where Q{i) denotes the term originating from the z-th block. In the following 
we distinguish three different types of weighting functions 0{i,m) for off-line, 
on-line, and block-on-line realizations [28]. The weighting functions have a 
finite support, and are normalized such that ~ 
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Figure 10.8 Weighting function j3{i, m) for on-line implementation. 

4.1 OFF-LINE IMPLEMENTATION 

When realizing the algorithm as an off-line or so-called batch algorithm, then 
f(i, m) corresponds to a rectangular window (Fig. 10.7), which is described by 

where eafi{i) = 1 for a < i < b, and €af{i) = 0 
else. The entire signal is segmented into blocks, and then the entire signal is 
processed to estimate the demixing matrix where the superscript £ denotes 
the current iteration of the coefficient update 

1=0 

Hence, the algorithm is generally visiting the signal data repeatedly for each 
iteration £ and therefore it usually achieves a better performance compared to 
its on-line counterpart. 

4.2 ON-LINE IMPLEMENTATION 

In time-varying environments an on-line implementation of (10.87) is re- 
quired. An efficient realization can be achieved by using a weighting function 
with an exponential forgetting factor A (Fig. 10.8). It is defined by 

= {I - ( 10 - 89 ) 

where 0 < A < I. Thus (10.87) reads 

m 

AW(m) = (1 - A) ^ A"‘-‘Q(f), 

1=0 



( 10 . 90 ) 
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Figure 10.9 Weighting function /3(t,m) forblock-on-line implementation. Note that m' = ^ 
denotes the new block index. 

where m denotes the current block. Additionally, (10.90) can be formulated 
recursively to reduce computational complexity and memory requirements since 
only the preceding demixing matrix has to be saved for the update. This leads 
to the following coefficient update to be used in (10.23): 

AW(m) = AAW(m - 1) + (1 - A)Q(m). (10.91) 

For the special case A = 0 we have 

W(m) = W(m - 1) - /rQ(m), (10.92) 

which corresponds to /3{i,m) = 6{i — m). 

4.3 BLOCK-ON-LINE IMPLEMENTATION 

The on-line and off-line approaches can be combined in a so-called block- 
on-line method (Fig. 10.9) which has been applied for BSS, e.g., in [51]. After 
obtaining K blocks of length N we process an off-line algorithm with ^rnax 
iterations. The demixing filter matrix W(m') of the current block m' is then 
used as initial value for the off-line algorithm of the next block. This block-on- 
line approach allows a tradeoff between computational complexity on the one 
hand and separation performance and speed of convergence on the other hand 
by adjusting the maximum number of iterations ^tnax we will see in Sect. 5.. 

5. EXPERIMENTS AND RESULTS 

Experiments have been conducted using speech data convolved with impulse 
responses of a real room (580cm x 590cm x 310cm) with a reverberation time 
Teo = 150ms and a sampling frequency of 16 kHz. A two-element microphone 
array with an inter-element spacing of 16 cm was used. The speech signals ar- 
rived from two different directions, ^5° and 45°. The distance between the 
speakers and the microphones was 2.0m. The length of the source signals 
(two male speakers from the TIMIT speech corpus [52]) was 10 seconds. The 
performance was evaluated by means of the signal-to-interference ratio (SIR), 
defined as the ratio of the signal power of the target signal to the signal power 
from the jammer signal. For off-line implementations the SIR was calculated 
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Figure 10.10 Comparison of different off-line realizations (left: SOS algorithms, right: SOS 
vs. HOS). 



over the entire signal length, whereas for on-line implementations it was con- 
tinuously calculated for each block. In the following the SIR is averaged over 
both channels. 

In our experiments we compared off-line and on-line realizations and we 
examined the effect of taking into account different numbers of time-lags D 
for the computation of the correlation function in (10.28) and (10.37). In all 
experiments the unmixing filter length was set to L = 512, the number of lags 
to D = 512, and the block length io N - 1024, respectively. Note that the 
stepsizes of all algorithms have been maximized up to the stability margins. 

The framework developed here also allows a better understanding of the 
initialization of W. It can be shown using (10.6) and (10.25) that the first 
coefficient of each filter Wpp must be nonzero. This is ensured by using unit 
impulses for the first filter tap in each Wpp. The filters Wp^, p ^ qwe set to 
zero. 

In the left plot of Fig. 10.10 different off-line SOS algorithms are shown. It 
can be seen that the approximated gradient (10.82) (dashed) and natural gradient 
(10.83) (solid) versions of the generic SOS frequency-domain algorithm exhibit 
the best convergence. This is mainly due to the decomposition of the update 
equation in its frequency components and hence we have an independent update 
in each frequency bin. The complete decoupling and therefore also the internal 
permutation problem is prevented by considering the constraint 
(10.59) (see Sect. 3.1). 

It should be pointed out that the generic SOS time-domain algorithm (10.36) 
(dotted) achieves almost the same convergence as the frequency-domain algo- 
rithms. This shows that also time-domain algorithms can exhibit a stable and 
robust convergence behaviour for long unmixing filters. However, in the generic 
SOS time -domain algorithm this comes with an increased computational cost, 
as an inversion of a large matrix is required due to the RLS-like normalization 
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Figure 10.11 Effect of exploiting nonwhiteness by taking into account different numbers of 
lags D. (L = 512) 



(see Sect. 2.4.2). The approximated version of the generic SOS time-domain 
algorithm (dash-dotted) according to (10.40) shows a slower convergence as 
the RLS-like normalization is replaced hy a diagonal matrix which corresponds 
to an NLMS-like normalization. Moreover it can he seen that all curves con- 
verge to the same maximum SIR value which does not depend on the choice of 
adaptation algorithm. 

In the right plot of Fig. 10.10 we compared the generic SOS algorithm in 
the time-domain and the generic HOS algorithm with the SIRP model from the 
Laplacian pdf (10.33). Note that the argument of the modified Bessel functions 
Kd/ 2 +\{') in (10.33) has to he properly regularized. The additional gain in 
convergence speed of HOS over SOS is due to the additional exploitation of 
nongaussianity. 

In Fig. 10.11 the dependency of the SIR on the number of lags D used 
for the computation of the correlation function Ryy in the SOS algorithms is 
illustrated. An off-line version of the approximated time-domain algorithm 
(10.40) was evaluated after 50 iterations. We observe a steep increase of the 
achievable separation performance for up to 8 lags. This can be explained by 
the fact that speech is strongly correlated within the first lags. By considering 
these temporal correlations, i.e., nonwhiteness, additional information about the 
mixtures is taken into account for the simultaneous diagonalization of Ryy . A 
further increase of D still improves the SIR slightly as the temporal correlation 
of the room impulse response is considered in the adaptation. 
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Figure 10.12 Comparison of different on-line realizations. 



Various on-line realizations of SOS algorithms are shown in Fig. 10.12. 
Obviously, the frequency-domain algorithm (dashed) exhibits superior conver- 
gence compared to the time-domain algorithm (dash-dotted) due to the NLMS- 
like approximation of the normalization in the time domain. However, it can 
also be seen that this effect can be mitigated by using a block-on-line adaptation 
(see 4.3) (solid) with K - S, N - 512, and £max = 10 iterations. This leads to 
improved convergence and separation performance at the expense of increased 
computational cost. 

6. CONCLUSIONS 

We presented a unified treatment of BSS algorithms for convolutive mix- 
tures. This framework contains two main principles: Firstly, three fundamental 
signal properties, nonwhiteness, nonstationarity, and nongaussianity are explic- 
itly taken into account in the generic cost function. Secondly, the framework is 
based on a general broadband formulation and optimization of this cost function. 
Due to this approach, rigorous derivations of both known and novel algorithms 
in the time and frequency domain became possible. Moreover, the introduced 
matrix formulation with the resulting constraints provides a deeper understand- 
ing of the internal permutation ambiguity appearing in traditional narrowband 
frequency-domain BSS. Experimental results confirm the theoretical findings 
and demonstrate that this approach allows BSS in both, time and frequency 
domains for reverberant acoustic environments. 





290 Audio Signal Processing 

References 

[1] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Analysis, Wiley & Sons, 
Inc., New York, 2001. 

[2] M. Zibulevsky and B.A. Pearlmutter, “Blind source separation by sparse decomposition 
in a signal dictionary,” Neural Computation, vol. 13, pp. 863-882, 2001. 

[3] S. Araki, S. Makino, A. Blin, R. Mukai, and H. Sawada, “Blind Separation of More 
Speech than Sensors with Less Distortion by Combining Sparseness and ICA,” in Proc. 
Int. Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, 
pp. 271-274. 

[4] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non gaussian signals,” lEE 
Proceedings-F, vol 140, no. 6, pp. 362-370, Dec. 1993. 

[5] W. Herbordt and W. Kellermann, “Adaptive beamforming for audio signal acquisition,” in 
Adaptive signal processing: Application to real-world problems, 1. Benesty and Y. Huang, 
Eds., pp. 155-194, Springer, Berlin, Jan. 2003. 

[6] E. Weinstein, M. Feder, and A. Oppenheim, “Multi-channel signal separation by decor- 
relation,” IEEE Trans, on Speech and Audio Processing, vol 1, no. 4, pp. 405-413, Oct. 
1993. 

[7] L. Molgedey and H.G. Schuster, “Separation of a mixture of independent signals using 
time delayed correlations,” Physical Review Letters, vol. 72, pp. 3634-3636, 1994. 

[8] L. Tong, R.-W. Liu, V.C. Soon, and Y.-F. Huang, “Indeterminacy and identifiability of 
blind identification,” IEEE Trans, on Circuits and Systems, vol. 38, pp. 499-509, 1991. 

[9] S. Van Gerven and D. Van Compernolle, “Signal separation by symmetric adaptive decor- 
relation: stability, convergence, and uniqueness,” IEEE Trans. Signal Processing, vol. 43, 
no. 7, pp. 1602-1612, 1995. 

[10] K. Matsuoka, M. Ohya, and M. Kawamoto, “A neural net for blind separation of nonsta- 
tionary AgnAs,,” Neural Networks, vol. 8, no. 3, pp. 411-419, 1995. 

[11] M. Kawamoto, K. Matsuoka, and N. Ohnishi, “A method of blind separation for convolved 
non-stationary signals,” Neurocomputing, vol. 22, pp. 157-171, 1998. 

[12] J.-F. Cardoso and A. Souloumiac, “Jacobi angles for simultaneous diagonalization,” SIAM 
J. Mat. Anal. Appl.,\o\. 17, no. 1, pp. 161-164, Jan. 1996. 

[13] S. Ikeda and N. Murata, “An approach to blind source separation of speech signals,” Proc. 
Int. Symposium on Nonlinear Theory and its Applications, Crans-Montana, Switzerland, 
1998. 

[14] L. Parra and C. Spence, “Convolutive blind source separation of non-stationary sources,” 
IEEE Trans. Speech and Audio Processing, pp. 320-327, May 2000. 

[15] D.W.E. Schobben and P.C.W. Sommen, “A frequency-domain blind signal separation 
method based on decorrelation,” IEEE Trans on Signal Processing, vol. 50, no. 8, pp. 
1855-1865, Aug. 2002. 

[16] H.-C. Wu and J.C. Principe, “Simultaneous diagonalization in the frequency domain 
(SDIF) for source separation,” in Proc. IEEE Int. Symposium on Independent Compo- 
nent Analysis and Blind Signal Separation (ICA), 1999, pp. 245-250. 

[17] C.L. Fancourt and L. Parra, “The coherence function in blind source separation of convo- 
lutive mixtures of non-stationary signals,” in Proc. Int. Workshop on Neural Networks for 
Signal Processing (NNSP), 2001. 




BSS for Convolutive Mixtures: A Unified Treatment 291 



[18] P. Comon, “Independent component analysis, a new concept? ” Signal Processing, vol. 
36, no. 3, pp. 287-314, Apr. 1994. 

[19] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing, Wiley & Sons, 
Ltd., Chichester, UK, 2002. 

[20] S. Amari, A. Cichocki, and H.H. Yang, “A new learning algorithm for blind signal sepa- 
ration,” in Advances in neural information processing systems, 8, Cambridge, MA, MIT 
Press, 1996, pp. 757-763. 

[21] J.-F. Cardoso, “Blind signal separation: Statistical principles,” Proc. IEEE, vol. 86, pp. 
2009-2025, Oct. 1998. 

[22] P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neuro- 
computing, vol. 22, pp. 21-34, July 1998. 

[23] A.J. Bell andT.J. Sejnowski, “An information-maximisation approach to blind separation 
and blind deconvolution,” Neural Computation, vol. 7, pp. 1129-1159, 1995. 

[24] T. Nishikawa, H. Saruwatari, and K. Shikano, “Comparison of time-domain ICA, 
frequency-domain ICA and multistage ICA for blind source separation,” in Proc. Eu- 
ropean Signal Processing Conference (EUSIPCO), Sep. 2002, vol. 2, pp. 15-18. 

[25] R. Aichner, S. Araki, S. Makino, T. Nishikawa, and H. Saruwatari, “Time-domain blind 
source separation of non-stationary convolved signals with utilization of geometric beam- 
forming,” in Proc. Int. Workshop on Neural Networks for Signal Processing (NNSP), 
Martigny, Switzerland, 2002, pp. 445-454. 

[26] H. Buchner, R. Aichner, and W. Kellermann, “A generalization of a class of blind source 
separation algorithms for convolutive mixtures,” Proc. IEEE Int. Symposium on Indepen- 
dent Component Analysis and Blind Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 
945-950. 

[27] H. Buchner, R. Aichner, and W. Kellermann, “Blind Source Separation for Convolu- 
tive Mixtures Exploiting Nongaussianity, Nonwhiteness, and Nonstationarity,” Proc. Int. 
Workshop on Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, September 2003. 

[28] R. Aichner, H. Buchner, S. Araki, and S. Makino, “On-line time-domain blind source 
separation of nonstationary convolved signals,” Proc. IEEE Int. Symposium on Indepen- 
dent Component Analysis and Blind Signal Separation (ICA ), Nara, Japan, Apr. 2003, pp. 
987-992. 

[29] E. Moulines, O. Ait Amrane, and Y. Grenier, “The generalized multidelay adaptive filter: 
structure and convergence analysis,” IEEE Trans. Signal Processing, vol. 43, pp. 14-28, 
Jan. 1995. 

[30] H. Brehm and W. Stammler, “Description and generation of spherically invariant speech- 
model signals,” Signal Proceiimg vol. 12, pp. 119-141, 1987. 

[31] H. Buchner, J. Benesty, and W. Kellermann, “Multichannel Frequency-Domain Adaptive 
Algorithms with Application to Acoustic Echo Cancellation,” in J.Benesty and Y. Huang 
(eds.). Adaptive signal processing: Application to real-world problems. Springer- Verlag, 
Berlin/Heidelberg, Jan. 2003. 

[32] H. Gish and D. Cochran, “Generalized Coherence,” Proc. IEEE Int. Conf. on Acoustics, 
Speech, and Signal Processing (ICASSP), New York, NY, USA, 1988, pp. 2745-2748. 

[33] H.H. Y ang and S . Amari, “Adaptive online learning algorithms for blind separation: max- 
imum entropy and minimum mutual information,” Neural Computation, vol. 9, pp. 1457- 
1482, 1997. 




292 



Audio Signal Processing 



[34] T.M. Cover and J.A. Thomas, Elements of Information Theory, Wiley & Sons, New York, 
1991. 

[35] S. Haykin, Adaptive Filter Theory, 3rd ed., Prentice Hall., Englewood Cliffs, NJ, 1996. 

[36] D.H. Brandwood, “A complex gradient operator and its application in adaptive array 
theory,” Proc. lEE, vol. 130, Pts. F and H, pp. 11-16, Feh. 1983. 

[37] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 3rd ed., McGraw- 
Hill, New York, 1991. 

[38] F.D. Neeser and J.L. Massey, “Proper Complex Random Processes with Applications to 
Information Theory,” IEEE Trans, on Information Theory, vol. 39, no. 4, pp. 1293-1302, 
July 1993. 

[39] S. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, 
pp. 251-276, 1998. 

[40] J.D. Markel and A.H. Gray, Linear Prediction of Speech, Springer-Verlag, Berlin, 1976. 

[41] J.W. Brewer, “Kronecker Products and Matrix Calculus in System Theory,” IEEE Trans. 
Circuits and Systems, vol. 25, no. 9, pp. 112-Til, Sep. 1978. 

[42] M.Z. Ikram and D.R. Morgan, “Exploring permutation inconsistency in blind separation of 
speech signals in a reverberant environment,” Proc. IEEE Int. Conf. on Acoustics, Speech, 
and Signal Processing (ICASSP), Istanbul, Turkey, June 2000, vol. 2, pp. 1041-1044. 

[43] H. Sawada, R. Mukai, S. Araki, and S. Makino, “Robust and precise method for solving 
the permutation problem of frequency-domain blind source separation,” Proc. IEEE Int. 
Symposium on Independent Component Analysis and Blind Signal Separation (ICA), Nara, 
Japan, Apr. 2003, pp. 505-510. 

[44] J.-S. Soo and K.K. Pang, “Multidelay block frequency domain adaptive filter,” /££■£ Trans. 
Acoust., Speech, Signal Processing, vol. ASSP-38, pp. 373-376, Feb. 1990. 

[45] P.C.W. Sommen, P.J. Van Gerwen, H.J. Kotmans, and A.J.E.M. Janssen, “Convergence 
analysis of a frequency-domain adaptive filter with exponential power averaging and 
generalized window function,” IEEE Trans. Circuits and Systems, vol. 34, no. 7, pp. 788- 
798, July 1987. 

[46] J. Benesty, A. Gilloire, and Y. Grenier, “A frequency-domain stereophonic acoustic echo 
canceller exploiting the coherence between the channels,” J. Acoust Soc. Am., vol. 106, 
pp. L30-L35, Sept. 1999. 

[47] G. Enzner and P. Vary, “A soft-partitioned frequency-domain adaptive filter for acoustic 
echo cancellation,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing 
(ICASSP), Hong Kong, China, April 2003, vol. 5, pp. 393-396. 

[48] R.M.M. Derkx, G.P.M. Egelmeers, and P.C.W. Sommen, “New constraining method for 
partitioned block frequency-domain adaptive filters,” IEEE Trans. Signal Processing, vol. 
50, no. 9, pp. 2177-2186, Sept. 2002. 

[49] F.J. Harris, “On the use of windows for harmonic analysis with the discrete Fourier trans- 
form,” Proc. IEEE, vol. 66, pp. 51-83, Jan. 1978. 

[50] S. Sawada, R. Mukai, S. de la Kethulle de Ryhove, S. Araki, and S. Makino, “Spectral 
smoothing for frequency-domain blind source separation,” in Proc. Int. Workshop on 
Acoustic Echo and Noise Control (IWAENC), Kyoto, Japan, Sep. 2003, pp. 311-314. 

[51] R. Mukai, H. Sawada, S. Araki, and S. Makino, “Real-Time Blind Source Separation for 
Moving Speakers using Blockwise ICA and Residual Crosstalk Subtraction,” Proc. IEEE 
Int. Symposium on Independent Component Analysis and Blind Signal Separation (ICA), 
Nara, Japan, Apr. 2003, pp. 975-980. 




BSS for Convolutive Mixtures: A Unified Treatment 293 



[52] J.S. Garofolo et al., “TIMIT acoustic-phonetic continuous speech corpus,” National Insti- 
tute of Standards and Technology, 1993. 




IV 

AUDIO CODING AND REALISTIC 
SOUND STAGE REPRODUCTION 




Chapter 1 1 

AUDIO CODING 



Gerald Schuler 
Fraunhofer AEMT 
schuller@emt.iis.fhg.de 



Abstract In this chapter, the principles of audio coding will be described, with emphasis 
on low delay audio coding. Audio coding is based on psycho-acoustic masking 
effects, as computed by psycho-acoustic models. To use the masking effects 
and to obtain a good compression ratio, filter banks are used. The principles of 
psycho-acoustics and of the design of filter banks are presented. Further a new 
low delay audio coding scheme based on prediction is shown. 

Keywords: Psycho-Acoustics, Filter Banks, Polyphase, Low Delay Audio Coding, Predic- 

tion 



1. INTRODUCTION 

Communications applications today use speech coders and usually close 
talking microphones and only one talker at a time and little background noise. 
Examples are telephones and cell phones. Speech coders have the important 
advantage that they have a sufficiently low encoding/decoding delay. To obtain 
a smooth flow of conversation it was shown that a round-trip delay of less than 
90 ms is desirable [2]. This includes encoding/decoding twice (once in each 
direction) and the transmission delay, which can be a few 10 ms. Hence the 
encoding/decoding delay should be clearly less than 45 ms. 

New communications scenarios include teleconferencing with many talkers 
and distant microphones, wireless microphones, wireless speakers, digital feed- 
back channels for musicians, or musicians playing together remotely. These 
communications applications are not restricted to one talker at a time, nor to 
close talking microphones, and not even to speech, e.g. in the case of musicians. 

What are the requirements for these applications? First the encod- 
ing/decoding delay becomes even more restrictive. If musicians play together 
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remotely the delay should not he higger than the delay the sound on a hig stage. 
Similarly if there is a mix of direct sound and the encoded/decoded signal, for 
example in a feedback channel for musicians, in live events, or in combinations 
of wired and wireless speakers. In these applications the encoding/decoding 
delay should be less than about 6 ms. 

The next requirement is for the sound quality of the decoded signal. For 
speech signals big changes in the signal are accepted as long as it is intelligible 
(telephone quality). For music such degradations are not accepted, as can be 
seen for example in AM Radio. There the transmission bandwidth is notably 
reduced, and hence the programming consists almost completely of speech, and 
only very little of music. Also when there is more than one talker at a time, if 
there is background noise or room reverberation, it helps the intelligibility to 
have a higher audio quality. 

What is the principle of speech and audio coders? Speech coders are mainly 
based on a model of speech production. Only a few assumptions about the 
receiver are used to construct speech coders. In fact, the receiver does not even 
need to be the human ear, it can be a machine, like a speech recognizer. This 
is applied for example for speech recognition over cell phones. The problem 
for our newer applications is obvious, the source model does not work for 
non-speech signals like music or background noise. 

2. PSYCHO-ACOUSTICS 

To overcome the limitations of a source model, audio coding uses knowledge, 
or a model, of the receiver, the human ear, to obtain data compression. Psycho- 
acoustics is a field that investigates the properties of human hearing. 

The effects that are mainly used in audio coding are masking across fre- 
quency and across time. The cochlea of the inner ear is conducting a frequency 
decomposition, analog to the filter banks used in audio coding. This is the 
underlying cause of many different masking effects found in psycho-acoustics. 

First simultaneous masking: if there is a signal at a certain frequency and 
with a certain level, a signal at the same frequency but with a certain lower 
level is not detected by the ear. It is “masked” by the stronger signal. The level 
differences for simultaneous masking depend on the level and if the signals are 
tones or noise like signals. A simple estimation for the simultaneous masking, 
which can be used for psycho-acoustic models, is: 

0{z)!dB = tt(14.5 -F 2) + (1 - a)5.5, 

where z is the signal frequency in Bark (as defined below), and a is a tonality 
measure, which is equal to 1 for pure tonal signals and equal to 0 for pure noise 
like signals [12, 13]. In general, 

tt=:min(SFM/SFMmax,l), 
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Figure 11.1 Masking threshold as superposition of individual tonal maskers and the threshold 
in quiet (from [12]). 

with SFMmax = — 60dB. The Spectral Flatness Measure SFM is [12, 13] 

SFM = I01ogio(^fl^|2_^^). 

M 2^fc=0 Vk 

A weaker signal is not only masked if it is at the same frequency, but also 
at nearby frequencies. The level at which signals at nearby frequencies are 
masked is given by the “spreading function,” which is a decaying function 
towards higher frequency differences. 

The increase from lower frequencies is about 27 dB per Bark. The decrease 
towards higher frequencies is about 27 - 0.37 max(LA^' ~ 40,0) dB per Bark 
[14], where Lm is the maskers sound pressure level. 

Figure 11.1 shows spreading functions of tonal maskers and their superposi- 
tion together with the masking threshold in quiet, to result in the overall masking 
threshold for the signal. 

The Bark scale is a nonlinear scale that describes the nonlinear, almost loga- 
rithmic processing in the ear. The Bark scale (z) can be approximated with the 
following equation: 

z = U- atan(0.76 • f) + 3.5 • atan((f/7.5)2) 

for 2 : in Bark and f in kHz. 

Masking is not only appearing over frequency, but also over time. If there is 
a stronger onset or a click at a time, a smaller click with a certain lower level 
a short time after it is masked. The level of the masked signal depends on the 
time, and is again a decaying function, towards later times. The ear needs a 
“recovery time” from the stronger click or onset. This is the forward masking 
The masking even extents to just before the click. There are processing times 
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Figure 1 1.2 Temporal masking threshold (from [15]). 



in the ear which result in a “backward masking.” This means signals just before 
the click are masked. But the decay of the masking threshold before the click 
is much faster (it practically disappears within a few milli-seconds before the 
click) than after it. This can be seen in Fig. 11.2. 

The superposition or addition of different masking thresholds leads to some 
interesting effects. Consider the case where two maskers are close together in 
frequency. At the point where the two masking thresholds have equal intensity, 
they lead to an increase of the masking threshold of about 3 dB, as would be 
expected by the addition of their intensities. But if these maskers are further 
apart (more than about a critical band), the increase of masking is more, it can 
be up to about 12 dB [15]. This is also sketched in Fig. 11.1. 

3. FILTER BANKS 

An important part of most audio coders are filter banks. Early audio coders 
had simple Discrete Cosine Transforms of consecutive blocks of the audio signal 
as a filter bank, to obtain a time/frequency description. But it was found that 
simple block transforms are not sufficient for high quality audio coding. They 
lead to so-called blocking artifacts. Figure 11.3 shows the block diagram of a 
filter bank with critical sampling, which means the downsampling rate in each 
subband is equal to the number of subbands N. This ensures that the filter 
bank does not introduce any additional redundancy to the signal. The filter 
bank also has the so-called perfect reconstruction property, which means the 
subband signals can be used to perfectly reconstruct the original signal by the 
synthesis filter bank, but with a system delay of samples. 
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Figure 11.3 An N • channel filter bank with critical downsampling, perfect reconstruction, and 
a system delay of nd samples. 

3.1 POLYPHASE FORMULATION 

To obtain a fast implementation, and also a mathematical description to 
make it easier to obtain perfect reconstruction, the polyphase description or 
structure is used. The effect of downsampling and upsampling in the analysis 
and synthesis fdter bank, respectively, can be viewed as processing the signal 
in blocks of length N. The input signal is represented by an A^-dimensional 
vector x(m) composed of sequences of the downsampled x{n), 

x(m) := [xo(m),... 



with 

Xi{m) := x{mN + i), 

and the vector of the N outputs of the analysis filter bank is (see also Fig. 1 1 .3) 
y(m) = [yo(m),. . . ,yAr_i(m)]. 

The z-transform of x(m) is given by 

X{z) = [Xo{z),...,Xm-i{z)], 

similar for y(m). The polyphase matrix for the analysis filter hank is Pa('2^) 
[16], 

•Po,o(^) Po,l{z) Po,N-l{z) 

„ , , ^t.o(^) A,t(^) 

Pa(^) = . 

_ Pn~\,o{z) Pn-\,n-\{z) 

which contains different phases of the downsampled impulse responses. 



L-l 

PnA^) = + N ~ 1 - 

m=0 
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where L is the filter length in blocks of N samples. The polyphase matrix for 
the synthesis filter hank is defined similarly as 

L-l 

9 k{mN + n)z~^. 

m =:0 

The analysis filtering and downsampling and the synthesis filtering and upsam- 
pling operation can then he written as 

Y(z) = X(z) • Pa(z) , X(z) = Y(z) • Ps(z). 

Note that the signal vectors are multiplied from the left side, which has the 
advantage that it matches the signal flow in block diagrams (from left to right, 
see also Fig. 11.4). This is useful when converting the product into a filter 
structure. It can be seen that perfect reconstruction results if [16] 

Pa(z)-Ps(z) (11.1) 

where 

0 0 ••• 0 z 

1 0 0 

01 0 ••• 0 

0 0 10 

is a “shift matrix.” The multiplication with S”‘(z) means a shift of rit samples 
in a block. Here it means the reduction of the delay of d blocks of length N 
by m samples. Fig. 11.4 shows that for Pa(-z) • Ps('Z) = I the system delay 
of the filter bank is Y - 1 samples. This is the blocking delay, which results 
from forming input blocks of length N before processing them. In general the 
system delay is the blocking delay plus the delay introduced by the above 
matrices, 

Ud — N - 1 + d ■ N - nt, 

where n* can be used for the “fine tuning” of the system delay. Because this 
is a general formulation for filter hanks with a downsampling rate of N, it also 
shows that the minimum possible system delay for these filter banks is N - 1 
samples. This is the system delay e.g. if the filter bank is a block transform of 
size N X N. 

3.2 MODULATED FILTER BANKS 

Filter banks in audio coding usually have a large number of subbands. 
MPEG-2/4 AAC for instance has 1024 subbands, and each filter has a length of 
2048 taps. For this reason it is important to have an efficient implementation. 
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Figure 1 1.4 Polyphase represemaiion of an N - channel filter bank with critical downsampling. 



which can be obtained by using so-called modulated filter banks. In modulated 
filter banks, each filter is the product of a so-called window function (typically 
a low-pass filter) and a modulating function, which shifts the pass band to the 
center frequency of the subband. In audio coding this modulating function is 
typically a cosine function. 

Now consider impulse responses of cosine modulated filter banks of the form 

hk{n) = /i(n) • cos -f 0.5)(n -h 0.5 -t- ria)^ , (11.2) 

gkin) = ^ • cos -f 0.5)(n -h 0.5 - ,(11.3) 

where A: = 0, . . . , - 1, n = 0, . . . , LN — 1, h{n) and h'{n) are the analysis 

and synthesis baseband prototype filters or window functions. The factor 2!N 
and the additional shift of N in gg is introduced to simplify the following 
notation. Two parameters Ua and are limited to 0 < na,ns < N. This 
particular type of modulating function was chosen because it leads to filters 
with narrow passbands and high stopband attenuation. 

A simple general design method for modulated filter banks is given in the 
following. The polyphase matrices are constructed as a product or a cascade of 
simpler matrices [17, 18]. 

The polyphase matrices Pa{^) P 5 (z) for cosine modulated filter banks 

as in (1 1.2, 11.3) have an interesting inherent structure. They can be written as 
a product of a sparse “filter matrix” Fa(2) or Fg(z) respectively, containing the 
prototype polyphase components only on the main diagonal and anti-diagonal 
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(and is zero otherwise), a transform matrix T, which is a simple DCT type 4, 

[T]„^fc := cos(^(/c + 0.5)(n + 0.5)), 0 < n,k < N, 

and the shift matrices S”“ and S"*, adjusting the system delay. The resulting 
product is 

Pa(.^) = S"“(.z)-Fa(2)T, (11.4) 

Ps{z) = T-iFs(^)-S"»(^). (11.5) 

With the help of the following matrices, the matrix Fa(z) or Fs(2) can he 
written as a product of factors [19, 17]. The factors are huild with the following 
matrices 

9o ' 

Gi= Q (11.6) 

_ 0 

with zeroes in all unspecified entries, or 

0 

0 

5/V/2-1 

So 

in a way that both types alternate, for instance the first type for the odd indices i, 
and the second for the even indices. Observe that = 0, meaning it is nilpo- 
tent of order 2. With the help of these matrices, the matrix Fa(^;) can be written 
as the product of “zero-delay matrices” {1 + GiZ-^), “maximum-delay matri- 
ces” (Iz~^ -t- Gi), and a diagonal coefficient matrix D = diag(do) • • • > dN-\)< 

fi—\ j/— r 

Fa(z) = 01-8) 

2 = 0 j = 0 

Observe that these matrix factors can also be seen as so-called Lifting steps, as 
seen in Fig. 11.5. 

To obtain the synthesis polyphase matrix Ps(2^) for perfect reconstruction, 
the inverse is needed. The inverse of the factors is simply 

(I-FGi^-*)-! = (I-G,z-'), 

{lz~^ + G^)-^ ■ z~^ = (Iz-i-G,). 
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I + GiZ-i I-Giz-' 

Figure 11.5 The flow graph corresponding to the factors I + GiZ~‘ and I - Gjz“‘ shows 
the lifting-like structure. 



In such a way, the synthesis filter matrix Fs(z:) for perfect reconstruction 
becomes 



f=o t=0 

Fs(zr) = D-i J](I - (11.9) 

u -\ ^—1 



With this synthesis filter matrix, the concatenation of analysis and synthesis 
filter hank, or the product of analysis and synthesis polyphase matrices is 



PaWPsW 



which means we have perfect reconstruction with a system delay of 2fj,N — 
ria — Tig samples. 

A simple and widely used example for a cosine modulated filter hank is 
the so-called MDCT (short for “modified discrete cosine transform”). It has 
jia = rig = (V'/2, = /i = I, a fdter length L = 2N, and h{n) = h'{n). The 
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resulting analysis filter matrix is 

0 h{0)z-^ h{N) 

h(^)z-^ 0 

mI)z- 

0 h{N - 1)^-* -h{2N - 1) 



h{M^) 

-h{k) 



( 11 . 10 ) 



This matrix includes the multiplication with the shift matrix, hence it has this 
diamond like shape. The synthesis filter matrix is 



F,{z)S^^^z) = 

r h'{^) h'if) 



h'{0) 

h'{N)z-^ 



h’{N-l) 
-h'[2N - 1)^-^ ■ 

( 11 . 11 ) 



We have perfect reconstruction if S^/^( 2 )Fa( 2 )Fs( 2 )S^/^(z) =1- z ^ Ma- 
trix algebra shows that this invertihility is obtained with 



h'{n) = 

h'{N + n) = 



h{n) 

h{n)h{2N - 1 - n) + h{N + n)h{N - 1 - n) ’ 
h{N -f n) 

h{n)h{2N — 1 — n) -f h{N + n)h{N - 1 - n) 



It can be seen that the denominator is the determinant of coefficients of 2 x 2 
subsets of matrix (11.10). If this determinant is equal to one, h'{n) = h{n). A 
simple window function which fulfills this condition is the sine window 

h(n) = sin(^(rH-0,5)) 

for n = 0, . . . , 2N — 1. This is already a commonly used window for audio 
coding. Another window is the so-called Kaiser-Bessel Derived (KBD) win- 
dow, which is numerically optimized for a higher stopband attenuation, and 
used for instance in the MPEG-AAC and Dolby AC-3 coders [20]. 
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Figure 1 1.6 Frequency responses of the sine window (dotted), the KBD window (dashed), and 
the low delay window (solid line). 



Both windows lead to so-called orthogonal filter banks (the impulse re- 
sponses of analysis and synthesis are time reversed versions of each other), 
with “unitary” polyphase matrices, i.e. Pa~^(^) = Pa^(2~^). In this case the 
system delay equals the length of the filters minus one, = LN — 1 . In the 
general case, with the above described design method, it is possible to obtain 
low delay filter banks with a lower delay than the length of the filters. This can 
be used for instance to obtain improved frequency responses without the need 
to increase the system delay. For example the frequency responses of the sine 
and the KBD windows can be surpassed by using a filter bank with twice the 
filter length but still the same delay. This can be seen in Fig. 11.6. Here the 
frequency responses of three different window functions for 1024 band filter 
banks are compared. The sine and KBD windows have lengths of 2048 taps and 
the corresponding filter banks have a system delay of 2047 samples. The low 
delay window has a length of 4096 taps, but a system delay of only the usual 
2047 samples. Because of the higher length, it has a better far off attenuation 
than the KBD window and a comparable nearby attenuation as the sine window. 
The corresponding window functions can be seen in Fig. 11.7. 
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Figure 11.7 The sine window (dotted), the KBD window (dashed), and the low delay window 
(solid line). 



3.3 BLOCK SWITCHING 

Most audio coders use block switching to avoid pre-echo artifacts. Block 
switching means that the filter bank is switched to a different number of sub- 
bands and a different filter length, all while processing the signal. This is done 
while maintaining perfect reconstruction at all times. How it is done for the 
sine window can be seen in Fig. 11.8. In (11.10) it can be seen that the first and 
last N/1 elements of the window change direction, they reverse the time. The 
operation of multiplicating the signal with (11.10) can also be seen as “time- 
domain aliasing,” the first and last N/1 elements of the signal are “folded” onto 
a sequence of length N. In order to invert this time-domain aliasing or folding 
we need the inverse structure as in (11.11). When we switch to a shorter win- 
dow, only a shorter length of the time domain aliasing can be reversed. Hence 
we need a switching window which has a shorter length of the time-reversal 
in the end, so that it can be inverted by the following short window. Usually 
the switch-down window contains the first half of the original long window, 
then a sequence of Al;/2 — Ngjl ones (with Ni the number of subbands for the 
long window, and Ng for the short window), followed by the last half of the 
short window. In order to keep the same overall blocking structure, this switch- 
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down window is followed by a multiple of Ni/Ng short windows, and then a 
switch-up window is used. The latter is a time reversed version of the switch- 
down window. The sequence of a switch-down window for 1024 bands, 8 short 
windows for 128 bands, and a switch-up window can be seen in Fig. 11.8. 

Low delay filter banks can be made time -varying by using time-varying 
coefficients in the matrices (11.6),(11.7), to obtain Gj(m), with the time index 
m, and also a time- varying diagonal coefficient matrix D(m). Observe a special 
commutation rule for these time-varying systems. If the signal is multiplied 
from the left, then [17] 

Gi{m) • z~^ — ■z~^Gi{m - ( 11 - 12 ) 

This means a signal which first passes G j (m) and is then delayed by one block 
is equal to a signal which is first delayed and the passes Gj(m — 1), with equally 
delayed coefficients. Hence for the maximum delay matrices in (11.8), (11.9) 
the following is true 

(Iz^^ + Gi{m)) ■ (l 2 “^ - Gi(m - 1)) = lz~‘^. 

Since the maximum delay matrices together with their inverses introduce a 
delay, (11.12) also means that the following inverses have to have accordingly 
delayed coefficients. In this manner also the number if bands can be switched. 
A suitable number of coefficients has to be set to zero, and for the transform 
matrix a smaller DCT is used, padded with zeros [17]. 

4. CURRENT AND BASIC CODER STRUCTURES 

Most audio coders consist of a few main building blocks (Fig. 11.9). The 
filter bank, which typically has 1024 bands, and which can be switched to 
128 bands, divides the audio signal into spectral components. In general more 
subbands lead to higher coding gain, which means a lower bit-rate. But more 
subbands also mean longer impulse responses of the filter bank, and this can lead 
to pre-echo artifacts at transients, like attacks [ 1 ]. Hence it can be switched to a 
lower number of subbands with shorter impulse responses to avoid pre-echoes 
at transient parts of a signal. For the switching decision a “look ahead” of one 
block is needed, which introduces a delay. The psychoacoustic model is mostly 
proprietary. The quantization step size in each subband is controlled by the 
psycho-acoustic model. This operation introduces the quantization noise. The 
quantization step-size has the be transmitted to the decoder for the decoding 
process. Hence this information has to be parameterized, in order to minimize 
the required bit-rate of this side-information. This parameterization determines 
the noise shape and hence should approximate the masking threshold as closely 
as possible. The usual approach is to use a piece-wise constant function. This 
is appropriate because the masking threshold is usually a smooth function over 
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Figure 11.8 Diagram of a sequence of switching windows. Left the switch-down window, then 
8 short windows, then the switch-up window. 



frequency. A frequency section with the same quantizer step size is often called a 
“scale factor hand.” The AACcoder, for instance, has 49 scale factor hands, with 
increasing numbers of suhhands per scale factor hand towards higher frequency, 
roughly according to the Bark scale. After the quantization the signal is entropy 
coded, usually with Huffman coding, with a set of different Huffman codehooks. 
For each scale factor hand the optimal codehook is chosen for the encoding, 
and its index is transmitted to the decoder, also as side information. Especially 
when switching to the shorter blocks this system leads to large fluctuations in 
instantaneous bit-rate. To smooth the bit-rate demand of the coder, buffering 
of the bit-stream is used. 

Examples are MPEG2/4 AAC, and PAC. AAC is an abbreviation for “Ad- 
vanced Audio Coding” and is the latest audio coder of the MPEG series of 
coders. PAC is for “Perceptual Audio Coding,” and is a proprietary audio coder 
with the same developmental roots as the AAC. 

A characteristic of the MPEG coders is that only the decoder and the bit- 
stream syntax is standardized. The encoder is described only as an “informa- 
tive” part. This leaves room for improvements on the encoder side, and hence 
different encoders can have different compression performances, for instance 
depending on the quality of the psycho- acoustic model. 
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Figure 11.9 Principle of a subband based audiocoder. AFB: analysis filter bank, SFB: synthesis 
filter bank, Q\ quantizers, EC: entropy coding, ED: entropy decoding. 



The desire for a high compression ratio leads to a high number of subbands. 
This in turn leads to long impulse responses and a high system delay of the 
analysis/ synthesis filter banks. Together with the look ahead delay for the 
switching decision and the delay introduced by buffering the bit-stream, this 
leads to a encoding/ decoding delay which is too high for communications 
purposes, usually from around 100 ms to a few hundred ms. 

5. STEREO CODING 

The goal of stereophonic coding is to produce a spatial acoustic impression 
during playback. To obtain this effect we can take a look at how the ear is 
determining the directions of a sound, the psycho-acoustic effects it is using. 
There are two main categories of clues or effects the ear is using. The first is the 
interaural time differences (ITD). The sound from a sound source closer to one 
ear reaches that ear first. From the time differences the brain can estimate the 
horizontal direction of the sound source. The second category is the interaural 
level differences (ILD). The sound from a sound source closer to one ear will 
have a higher level at that ear, because the sound to the other ear has to travel 
around the head. 

If we look more closely on how the ear and the brain is using those clues, 
we find that for the ITD below about 800 Hz, the ear uses mostly waveform or 
phase information. For frequencies above 800 Hz it uses the energy envelope 
of the sound. The ILD are more important at higher frequencies. Since the 
ear uses both categories for the estimation of the sound direction, they can be 
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Figure II. 10 Variants of stereo recordings. 



interchanged within limits for coding purposes. Figure 11.10 shows how those 
two effects can be used to make stereophonic recordings. 

For coding a stereophonic signal the spatial distribution of the quantization 
noise is important. The optimal masking of the quantization noise occurs, when 
it has the same apparent spatial position as the signal. If the quantization noise 
and the signal have different spatial positions, unmasking of the quantization 
noise can occur, even if it was masked in the mono case. This is quite a 
remarkable effect. It means, if one takes two identical or very similar audio 
signal, encodes them with the quantization noise just at the masking threshold, 
but with different quantization noise (different phase), and plays those two 
signals as left and right channel, then the quantization noise becomes audible, 
even if it was not audible in each channel in itself! To avoid this unmasking of 
the quantization noise, the masking threshold has to be lowered. The difference 
of the original and lowered masking threshold is called the “binaural masking 
level difference” (BMLD), which is up to 18 dB. 

This effect poses a problem for stereophonic coding. If the two channels, 
left and right, are simply encoded separately by a monophonic audio coder, 
and if the audio signal is centered, like a news speaker or an instrument in 
the middle, then unmasking of the quantization noise can occur. The decoded 
signal still appears in the center, whereas the uncorrelated quantization noise 
appears to come from the sides. A further problem is that the redundancy 
between the two channels is not used for compression of the signal. To solve 
this problem for centered signals, the signal is first processed by “matrixing,” 
or more specifically by computing the sum and difference of the two channels. 
The two channels for the encoding are then the “mid” (M), which is the average 
of the left (L) and right (R) channel, in effect the mono signal, and the “side” 
(S), the difference of left and right. 



M = {L + R)I2, S = {L-R)I2. 
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The decoder inverts this operation after decoding the Mid and Side channel, 



L = M + S, R = M-S. 

Observe that this operation is lossless in the absence of quantization. It can also 
be seen as a principal axis transform, where the coordinate system is rotated 
by 45 degrees, from a left/right to a mid/side coordinate system. If the audio 
signal contains mostly centered sources, most of the energy will be in the mid 
channel, and only very little in the Side channel. The reduced energy in the Side 
channel reduces the required bit-rate for the stereo encoding. Also observe the 
important property that now the quantization noise for the mid channel appears 
indeed from the center after decoding, since the mid channel appears in both, 
the left and right channel, after decoding. This way we solved the problem 
for centered signals. But what about signals from the sides? If we have an 
audio signal predominantly from one side, the matrixing poses a problem. It 
spreads the signal to the other channel, which means that unnecessary bits are 
needed, and it also leads to a quantization noise spread between the left and 
right channel. This means, for signals which are not centered it is advantageous 
to switch back to separate left/right coding. This switching information has to 
be transmitted to the decoder as side-information. Usually an audio signal 
cannot be clearly be categorized into centered or not. There are usually many 
different sound sources at different spatial locations. Fortunately they usually 
occupy different regions across frequency, at least as long as they are clearly 
distinguishable by the ear. This leads to a more efficient switching algorithm 
for the mid/side decision. The signal is divided into subbands, for instance 
roughly according to the Bark scale. The decision for Mid/Side or left/right 
coding is then made in each of these subband independently. This increases the 
necessary side information, but overall leads to a more efficient stereo coding. 
This is the most widely used stereo coding strategy. It is used for instance in 
MPEG-AAC, AC-3, and PAG. 

Mid/side coding can be used to obtain high quality stereo coding, but it also 
leads to some overhead in bit-rate. For lower bit-rates it can be worthwhile to 
reduce this overhead at the expense of the precision of the spatial rendering of 
the audio signal. This goal can be obtained by the so called intensity stereo. 
Behind intensity stereo is the assumption the we have a predominantly centered 
audio signal. It is a lossy coding, which means we lose precision for the 
spatial position of the sound sources, and it uses only the ILD to obtain the 
spatial effects. Usually it is used above a certain cut-off frequency, for instance 
4 kHz. In principle it works like mid/side coding, but instead of the side channel 
there is only a side-information about the amplitude in each channel. The mid 
channel becomes a so-called coupling channel. It is the mono version of the 
signal. Compared to the mid channel it contains a phase adjustment, to avoid 
cancellations of signals by the addition. The power in the Coupling channel 
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Figure 11.11 Diagram of an intensity stereo encoder. 
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Figure 11. 12 Diagram of an intensity stereo decoder. 



and in the left and right channel are measured. The relation of the left and right 
power to the power in the coupling channel is then transmitted as scaling side 
information to the decoder, together with the coupling channel. This process is 
done in suhhands, just like for mid/side coding. This can he seen in Fig. 11.11, 
[20]. The decoder then plays the coupling channel, hut scaled for the left and 
right side according to the scaling information in the suhhands, as in Fig. 11.12. 
Intensity stereo is used at the lower hit-rates in MPEG-AAC, AC-3, PAC, and 
MPEGl/2-Layer3 (MP3). 

6. LOW DELAY AUDIO CODING 

In full-duplex applications like teleconferencing, echoes bouncing hack from 
the far-end can occur, and unnatural delays in response times in interaction 
between parties are to avoid. The latter may be a problem in a conversation 
application if the round-trip time, RTT, from near-end to a far-end and back 
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Figure II.! 3 Required attenuation in a round trip loop. 



exceeds 90 ms [2]. Reflections are not a problem if the acoustic round-trip path 
can be eliminated, e.g., by using a combination of headphones and close-talk 
microphones. Otherwise, it is necessary to use acoustic echo cancellation, AEC, 
to attenuate return-path reflections. A distinct reflection arriving at listeners ears 
40-50 ms after the direct sound is called echo. At low RTTs, the perceived effect 
is a colorization of the signal. 

Figure 11.13 shows the required attenuation for a single reflection in three dif- 
ferent experiments. The dashed curve is from [21] and represents the round-trip 
attenuation at acceptable level for telephone speech. The dashed-dotted curve 
shows the corresponding ITU-T G. 131 recommendation. The solid curve in 
Fig. 11.13 shows measurement data averaged over widely used high-quality 
audio test material including Castanets, Suzanne Vega, female and male speak- 
ers, and flute at the sampling rate of 32 kHz [22]. 

The results are qualitatively in line with earlier results but, as expected, show 
significantly higher requirements for attenuation especially at high delays. The 
dip for the very low delays can be explained by the fact that very low delays 
(which are not realistic in most applications) result in a comb-filter like effect, 
which becomes noticeable. 

The data suggest that a reduction of 10 ms in the round-trip delay corresponds 
to a 3-4 dB drop in the requirements for echo cancellation. For example, if the 
algorithmic coding delay is diminished from 20 to 6 ms, the requirements for 
AFC are down by more than 10 dB. The required high attenuation for echoes 
(RTT > 50 ms) is very difficult to achieve. Therefore, a 25 ms one-way delay 
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is a realistic upper margin for echo-free audio communications. The speed of 
light in an optical fiber is approximately 200 km/ms. Hence, one may roughly 
estimate that a 1 ms decrease in algorithmic coding delay corresponds to a 
100 km increase in the range of echo-free communications. This all suggest 
that the coding delay should be below about 10 ms which is also close to the 
recommendations for low-delay speech coding [3]. 

To obtain an audio coder for communications applications, for instance the 
AAC- low delay coder was developed. It has the same basic structure as the AAC 
coder. To obtain a lower delay it features shorter filters and fewer subbands. It 
has 480 subbands with filters of length 960 taps, and has no switching to a lower 
number of subbands to avoid the look ahead for the switching decision. Instead 
it has a window shape switching to a window with the same number if subbands 
but less overlap and shorter length, and temporal noise shaping [23] to reduce 
pre-echo artifacts [24], Avoiding the switching also leads to less fluctuations 
in the bit-rate demand, which leads to a reduced buffer size and a lower delay. 
Without tbe buffering, it bas a delay of 960 samples, or 20 ms at 48 kHz sampling 
rate (accordingly scaled at other sampling rates). But its reduction of the number 
of subbands also leads to a reduced compression performance compared to the 
AAC coder. Its compression performance is roughly comparable to MP3, the 
predecessor of the AAC coder. Its operating range is usually between 48 and 
96 kb/s. 

The ITU G722.1 coder is also a subband coder, and it is more specialized 
towards speech coding. It has 320 subbands, operates at 16 kHz sampling rate 
and at bit-rates between 24 and 32 kb/s. It has a delay of 640 samples or 40 ms 
at 16 kHz sampling rate. 

There is a trade-off between the delay and the compression performance 
with subband coding. To avoid this trade-off or connection, a different coding 
principle is needed. Predictive coding has the same asymptotic compression 
performance (for stationary signals and infinitely many subbands or infinitely 
long predictors) [25], but it has the advantage that a predictor does not introduce 
a delay, no matter how long it is. This is a principle which has long been used 
in speech coding, as in ADPCM coders like G.726 or G.727. Also the speech 
coders used in cellular phones are based on predictive coding. The problem 
for its application in audio coding is the application of the psycho- acoustic 
model. It provides the quantization step size for each frequency band, and is 
hence directly suited for subband coding. To obtain a comparable noise shaping 
with predictive coding, a suitable psycho-acoustically controlled pre- and post- 
filtering can be used [4, 7]. In the first stage of the encoder the pre-filter is 
used. It is controlled by the psycho-acoustic model and its computed masking 
threshold M{f) , such that the frequency response H{f) of the pre-filter is 
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inverse to the masking threshold, 



H{f) = 



1 

WUYl' 



(11.13) 



The filtering of the input audio signal can then be seen as normalizing the 
signal to its masking threshold. This has the consequence that distortions below 
unity in magnitude are inaudible. Hence a simple uniform stepsize quantizer is 
used to quantize the signal. The operation of pre-filtering and quantizing can 
be seen as irrelevance reduction. All inaudible components are removed. Since 
further distortions would be audible, this stage needs to be followed by a lossless 
coding stage, to obtain the final compression and bit-stream. The decoder has a 
lossless decoder as a first stage, followed by the post-filter. This post-filter has a 
frequency response which is inverse to the pre-filter. Its frequency response cor- 
responds to the masking threshold as computed by the psycho-acoustic model in 
the encoder. Hence side-information is needed to transmit the filter coefficients 
of the pre-filter to the decoder. This can be done efficiently using so-called line 
spectral frequencies, as known from speech coding [26]. The psycho-acoustic 
model in the encoder is still based on a subband decomposition with a filter 
bank. But since this filter bank is not used for the redundancy reduction, it 
can have fewer subbands and shorter filters than in conventional audio coders. 
With 128 subbands for the psycho-acoustic model, the first stage in the encoder 
introduces a delay of about 128 samples. A block diagram of the system can 
be seen in Fig. 11.14. 

The pre -filter is constructed like a predictor, to make it easily invertible. With 
a direct form FIR implementation and the order of the pre-filter of K its output 
y{n) is related to its input x(n) through 



K 

y(n) = x(n) - ^ afcx(n — fc). (11-14) 

k=l 

The inverse DFT of |M(/)p gives the auto-correlation function 
Then the filter coefficients a*, are obtained by solving the linear equation system 
[5] 

K-\ 

'^rmm{\k - n\)ak = rmm{n + l), 0<n<K. (11.15) 

k=0 

The coefficients describe the masking threshold in a parametric form, just 
like the scale factors in conventional audio coders. Only here it is in effect 
an approximation by a polynomial (the frequency response of the post-filter). 
Compared to the conventional piece-wise constant form it has the advantage that 
it avoids unnatural steps in the approximation, it is inherently a smooth function 
and hence has the ability of a closer approximation of the masking threshold. It 
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Figure 11.14 The low delay audio coding scheme using psycho-acoustic pre- and post-filters 
and predictive lossless compression. 

was found that a 12th order filter is sufficient for this approximation. To further 
improve this approximation, frequency warped filters can be used in order to 
better follow the details of the masking threshold on the Bark scale [6]. 

The pre-filter and quantizer is followed by the lossless coder. Current loss- 
less audio coders are typically based on block wise forward prediction. The 
prediction coefficients for a block are transmitted as overhead, and the residuals 
are Huffman coded and transmitted. This means there is a delay of at least one 
block size. To obtain a low delay, backward adaptive predictive coding using 
the LMS algorithm is used [27], which is also a standard technique in low-delay 
speech coding [3]. The LMS algorithm is also widely used in on-line automatic 
control, or acoustic echo cancellation (cf. [5]). 

Let x{n) be the signal at time n, and x^(n) is defined as x^(n) := [x{n — 
L + 1), aj(n)] where L is the order of the prediction. An L’th-order predictor 
is of the form 



P(x(n - 1)) = x^(n - 1) • h(n), (11.16) 

where h(n) is the L-dimensional vector of predictor coefficients at time n. 
h(n) is updated with the normalized LMS: 

h(" + l)=hW + rTtIi^rT)P*("-l)^ (11.17) 

with e(n) being the prediction error. This is a special case of the normalized 
LMS [5], i.e. with only one tuning parameter A to trade off adaptation speed 
and accuracy. 

To obtain a high compression ratio, cascading and soft switching between 
the cascaded predictors can be used [8]. The filter bank in conventional audio 
coding has 2 modes, one with a high number of bands (typically 1024) but 
reduced time resolution, and one with a lower number of bands (typically 128) 
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Figure 11.15 The WCLMS lossless encoder. Q denotes rounding. 



but a higher time resolution. The mode depends on the signal and is hard 
switched (either one or the other). The analog of a filter bank with many 
bands in subband coding is a predictor with high order in predictive coding. In 
predictive coding, having more modes is simpler than in subband coding. This 
makes it possible to have 3 instead of 2 modes, and to obtain soft switching 
instead of hard switching. This is useful because it increases the prediction 
accuracy and also avoids high bit-rate peaks, as follows. Speech/audio signals 
have varied orders of correlations. Very non-stationary signals like sounds 
from castanets need a short predictor that is able to track the signal fast enough, 
whereas more stationary signals as sounds from flutes require higher prediction 
orders to accurately model the signal with all its spectral details. 

The LMS prediction is applied three times in a cascade, (Cascaded Weighted 
LMS, or WCLMS) leading to the predictors Pi, P 2 and P3. Since the residuals 
ei(n) of the first predictor are not integers but floating point numbers, they 
cannot be reproduced and stored in finite precision without losing accuracy. 
This is not a problem for a single LMS since its input x(n) are integers (PCM 
signals). However, in the second and third stages in cascading LMS, the non- 
integer residuals are the inputs to improve the accuracy of their prediction. But 
when the encoding and decoding sides have different rounding precisions, we 
are not able to synchronize the two sides and the encoder and decoder will 
produce different outputs. This problem is solved by limiting the precision of 
the residuals in a defined manner, e.g. using 8 hit precision after the fractional 
point. A diagram of a WCLMS encoder can be seen in Fig. 11.15, and the 
decoder in Fig. 11.16. 
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Figure II. 16 The WCLMS lossless decoder. Q denotes rounding. 

By using the cascade of predictors one of the main issues is how to select 
or combine these predictors. Bayesian statistics uses weighted combinations 
for an improved prediction performance (cf. [28]). Using this approach the 
model-based predictors Pi can be combined into 

Y^WiPi, w^>0,'^w^ = l, (11.18) 

i i 

where Wi is the posterior (i.e. based on the observed data) probability that 
Pi is “correct” given data to date, which can be viewed as a measure of the 
goodness-of-fit of the model or predictor Pj. 

For most audio signals the assumption of a Laplacian probability distribution 
yields a good fit to estimate the weights Wi . With a “forgetting factor” fj, and 
the joint probability density function, this leads to the estimate of the weight 
Wi’. 

tni(n) (11.19) 

The final stage is an entropy coder for the prediction error. Again, block 
based approaches like block based Huffman coding cannot be used because 
of their inherent delay. But (backward) adaptive schemes can be used, as the 
adaptive Huffman coder or the Arithmetic coder. 

The resulting audio coder has a delay which is mainly determined by the 
psycho-acoustic model in the encoder, with its block size of 128 samples. Tak- 
ing into account some delay for the entropy coding, the overall delay is about 
200 samples, or 6 ms at 32 kHz sampling rate. This is sufficient for the most 
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delay critical applications. At the same time it has a compression ration which 
is comparable to conventional audio coders. 

7. CONCLUSIONS 

Irrelevance reduction is necessary for audio coding, because we cannot rely 
on a source model, as in speech coding. For that reason the basics of psycho- 
acoustic models was described. These models work mainly in the frequency 
domain. This makes the use of subband coding convenient for audio compres- 
sion. Communications applications need a very low end-to-end delay of only 
a few milli- seconds. But subband coding has an inherent trade-off between 
compression performance and the resulting end-to-end delay. This would lead 
to a low compression performance for a very low delay. To avoid this trade- 
off, a different coding principle is useful. It is predictive coding, which does 
not have this inherent trade-off, and which is already widely used in speech 
coding. The problem is the application of psycho-acoustic models, who work 
in the frequency-domain. It was seen that this problem can be solved by us- 
ing psycho-acoustically controlled pre- and post-filters, and that indeed audio 
coders can be designed using predictive coding, using this method. This type 
of audio coder has a delay of only a few milli-seconds, and hence can be used 
even in very delay critical communications applications. 
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Abstract Conventional multichannel audio reproduction systems for entertainment or com- 
munication are not capable of immersing a large number of listeners in a well 
defined sound field. A novel technique for this purpose is the so-called wave 
field synthesis. It is based on the principles of wave physics and suitable for 
an implementation with current multichannel audio hard- and software compo- 
nents. A multiple number of fixed or moving sound sources from a real or virtual 
acoustic scene is reproduced in a listening area of arbitrary size. The listeners are 
not restricted in number, position, or activity and are not required to wear head- 
phones. A successful implementation of wave field synthesis systems requires 
to address also spatial aliasing and the compensation of non-ideal properties of 
loudspeakers and of listening rooms. 

Keywords: Wave Field Synthesis, Multichannel Audio, Loudspeaker Array, Loudspeaker 

Compensation, Room Compensation 
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1. INTRODUCTION 

State-of-the-art systems for the reproduction of spatial audio suffer from a 
serious problem: The spatial properties of the reproduced sound can only he 
perceived correctly in a small part of the listening area, the so-called sweet 
spot. This restriction occurs because conventional reproduction of spatial au- 
dio is based on psychoacoustics, i.e. mainly intensity panning techniques. A 
solution to this problem calls for a new reproduction technique which allows 
the synthesis of physically correct wave fields of three-dimensional acoustic 
scenes. It should result in a large listening area which is not restricted to a 
particular sweet spot. 

A number of different approaches have been suggested. They can be roughly 
categorized into advanced panning techniques, ambisonics, and wave field syn- 
thesis. Advanced panning techniques aim at enlarging the sweet spot well 
known from two-channel stereo or 5-channel surround sound systems. An ex- 
ample is the vector base amplitude panning technique (VBAP) as described e.g. 
in [1]. Ambisonics systems represent the sound field in an enclosure by an ex- 
pansion into three-dimensional basis functions. A faithful reproduction of this 
sound field requires recording techniques for the contributions of all relevant 
basis functions. In common realizations, only the lowest order contributions 
are exploited [2]. 

This chapter presents the third approach from above, wave field synthesis 
(WFS). It is a technique for reproducing the acoustics of large recording rooms 
in smaller sized listening rooms. The spatial properties of the acoustical scene 
can be perceived correctly by an arbitrary large number of listeners which 
are allowed to move freely inside the listening area without the need of their 
positions being tracked. These features are achieved through a strict foundation 
on the basic laws of acoustical wave propagation. 

WFS typically uses more loudspeakers than audio channels. Contrary to 
conventional spatial audio reproduction techniques like two-channel stereo or 
5-channel surround, there exists no fixed mapping of the audio channels to the 
reproduction loudspeakers. Instead, a representation of the three-dimensional 
acoustic scene is used that can be reproduced by various WFS systems utilizing 
different loudspeaker setups. 

The theory of WFS has been initially developed at the Technical University 
of Delft over the past decade [3, 4, 5, 6, 7] and has been further investigated 
in [8, 9, 10, 11, 12]. In [13] it has been shown how WFS is related to the 
Ambisonics approach. 

Section 2 covers the physical foundations and the basic rendering technique. 
Two methods for sound field rendering are covered in detail in Section 3. The 
practical implementation requires a thorough analysis of acoustical wave fields 
which is shown in Section 4. It is the basis of methods for compensating the 
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(a) Synthesis of a wave from from spher- 
ical waves according to Huygens' princi- 
ple. 




(b) Synthesis of a wave front by a loud- 
speaker array with appropriately weighted 
and delayed driving signals. 



Figure 12. 1 Basic principle of wave field synthesis. 



influence of non-ideal loudspeakers and of the listening room as described on 
Section 5. Finally the implementation of wave field synthesis systems is shown 
in Section 6. 

2. RENDERING OE SOUND EIELDS WITH WAVE 
EIELD SYNTHESIS 

This section gives an overview of the physical foundations of wave field 
synthesis and its practical realization. 

2.1 PHYSICAL EOUNDATION OE WAVE EIELD 
SYNTHESIS 

The theoretical basis of WFS is given by the Huygens’ principle. Huygens 
stated that any point of a propagating wave at any instant conforms to the 
envelope of spherical waves emanating from every point on the wavefront at 
the prior instant. This principle can be used to synthesize acoustic wave fronts 
of arbitrary shape. Of course, it is not very practical to position the acoustic 
sources on the wave fronts for synthesis. By placing the loudspeakers on an 
arbitrary fixed curve and by weighting and delaying the driving signals, an 
acoustic wave front can be synthesized with a loudspeaker array. Figure 12.1 
illustrates this principle. 

The mathematical foundation of this more illustrative description of WFS 
is given by the Kirchhoff-Helmholtz integral (12.1), which can be derived by 
combining the acoustic wave equation and the Green’s integral theorem [14] 
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Figure 12.2 Parameters used for the Kirchhoff-Helmholtz integral (12.1). 



r(t.oj) = 
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(12.1) 



Figure 12.2 illustrates the parameters used. In (12.1), S denotes the surface 
of an enclosed volume V, (r — rs) the vector from a surface point rg to 
an arbitrary listener position r within the source free volume V, 
the Fourier transform of the pressure distribution on S, and n the inwards 
pointing surface normal. The temporal angular frequency is denoted by u, 
/3 = tu/c is the wave number, and c the speed of sound. The terms involving 
exponential functions describe monopole and dipole sources, i.e. the gradient 
term accompanying P(rs,cu) represents a dipole source distribution while the 
term accompanying the gradient of P{rs,oj) represents a monopole source 
distribution. 

The Kirchhoff-Helmholtz integral states that at any listening point within the 
source -free volume V the sound pressure P{r,u) can be calculated if both the 
sound pressure and its gradient are known on the surface enclosing the volume. 
Thus a wave field within the volume V can be synthesized by generating the 
appropriate pressure distribution P(rs, cu) by dipole source distributions and its 
gradient by monopole source distributions on the surface S. This interrelation 
is used for WFS-based sound reproduction as discussed below. 
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2.2 WAVE FIELD SYNTHESIS BASED SOUND 
REPRODUCTION 



Following the above presentation of the physical foundations, a first intro- 
duction to the techniques for WFS -based sound reproduction is given here. 

The effect of monopole and dipole distributions described by (12.1) can be 
approximated by appropriately constructed loudspeakers. However, a direct 
realization of the scenario shown in Fig. 12.2 would require a very high number 
of two different types of loudspeakers densely spaced on the entire surface S. 
Since such an approach is impractical, several simplifications are necessary to 
arrive at a realizable system: 

1. Degeneration of the surface S to a plane So separating the primary sources 
and the listening area. 

2. Reformulation of (12.1) in terms of only one kind of sources, e.g. either 
monopoles only or dipoles only. 

3. Degeneration of the plane So to a line. 

4. Spatial discretization. 



These simplifications are now discussed in more detail. 

The first step maps the volume V to an entire half space such that the arbi- 
trarily shaped surface S turns into a plane So- Then the sound field caused by 
the monopoles and dipoles is symmetric with respect to the plane So- 

This symmetry is exploited in the second step. Since only the sound field on 
one side of Sq is of interest, there is no need to control the sound fields on both 
sides independently of each other. Consequently, one kind of sources on Sq can 
be eliminated. Dropping e.g. the dipoles and keeping the monopoles causes an 
even symmetry of the resulting sound field. Thus a considerable simplification 
results, since the desired sound field on one side of/So is generated by monopoles 
alone. 

The results of steps 1 and 2 for monopoles are described by the Rayleigh I 
integral [6] as 



P(r,u) = pc 



iR. 

2tt 



L 



V„(r5,a/)- 



-jP\r-Ta 



r - r. 



dSQ, 



( 12 . 2 ) 



where p denotes the static density of the air, c the speed of sound and Vn the 
particle velocity perpendicular to the surface. A similar description for dipoles 
only leads to the Rayleigh II integral [6]. 

The third step is motivated by the requirement to synthesize the wave field 
correctly only in the horizontal ear plane of the listener. Deviations from the true 
reconstruction well above and below the head of the listeners can be tolerated. 
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For this scenario the surface Sq further degenerates to a line surrounding the 
listening area. 

The fourth step is necessary because loudspeakers cannot be placed in- 
finitely close to each other. Equidistant spatial discretization with space in- 
crement A A gives the discrete monopole positions rj. Steps 3 and 4 result in 
a one-dimensional representation of the Rayleigh I integral (12.2) for discrete 
monopole positions [7], i.e. 



where Ai{ui) denotes a geometry dependent weighting factor. 

Based on the above equation, an arbitrary sound field can be synthesized 
within a listening area leveled with the listeners ears. The monopole distribution 
on the line surrounding the listening area is approximated by an array of closed 
loudspeakers mounted at a distance AA. In the most simple case, several linear 
loudspeaker arrays are placed at appropriate angles to enclose the listening area 
fully or partly. Figure 12.7 shows a typical arrangement. 

Up to now it has been assumed that no acoustic sources lie inside the volume 
V. The theory presented above can also be extended to the case where sources 
lie inside the volume V [5]. Then acoustical sources can be placed between the 
listener and the loudspeakers within the reproduction area (focused sources). 
This is not possible with traditional stereo or multichannel setups. 

The simplifications described above lead to deviations from the true sound 
pressure distribution. Particularly, the effects of spatial discretization limit the 
performance of real WES systems. In practice, two effects are observed: 

1. Spatial aliasing. 

2. Truncation effects. 



AA, (12.3) 



P{r,u) = Y^ 



Ai{u)P{n,uj) 



Spatial aliasing is introduced through the discretization of the Rayleigh in- 
tegral due to spatial sampling. The aliasing frequency is given by [7] 



/al = 



C 

2AAsina,nax 



(12.4) 



where a,nax denotes the maximum angle of incidence of the synthesized plane 
wave relative to the loudspeaker array. Assuming a loudspeaker spacing 
of AA = 10 cm, the minimum spatial aliasing frequency is /ai = 1700 Hz. 
Regarding the standard audio bandwidth of 20 kHz spatial aliasing seems to be 
a problem for practical WES systems. Fortunately, the human auditory system 
seems not to be very sensitive to these aliasing artifacts. 

Truncation effects appear when the line is represented by a loudspeaker array 
of finite extension. They can be understood as diffraction waves propagating 
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from the ends of the loudspeaker array. Truncation effects can be minimized, 
e.g. by filtering in the spatial domain (tapering, windowing) [7]. 

The remaining sections of this chapter show how to implement the physical 
processes described by the Rayleigh I integral by discrete-time, discrete-space 
algorithms. For the sake of readability of the text, some simplifications of the 
notation are adopted. Most relations are given in terms of Fourier transfor- 
mations with respect to the continuous time t. The corresponding frequency 
variable in the Fourier domain is denoted by u/. Discrete-time operations (e.g. 
convolutions) are given in terms of the discrete-time variable k. Discrete-time 
functions (sequences) are distinguished from their continuous-time counterparts 
by brackets (e.g. h[k]). For simplicity, no corresponding discrete-time Fourier 
transformation is introduced. Instead, the Fourier transformation is used. It is 
assumed that the reader is familiar with the basic continuous-to-discrete and 
time-to-frequency transitions. 

3. MODEL-BASED AND DATA-BASED RENDERING 

The Rayleigh I integral (12.3) formulates a mathematical description of the 
wave fields that are synthesized by linear loudspeaker arrays. The loudspeaker 
driving signals can be derived by two different specializations of the Rayleigh 
I integral which result in two rendering techniques suited for different appli- 
cations of WFS systems. These two rendering techniques are described in the 
next sections. 

3.1 DATA-BASED RENDERING 

In (12.3) the expression — rj| describes the sound pressure 

propagating from the ith loudspeaker to the listener position r. Therefore, the 
loudspeaker driving signals for arbitrary wave fields can be computed 

according to (12.3) as 



W^{oJ) = Mu)P{vi,oJ). ( 12 . 5 ) 

The pressure distribution P{vi, u>) contains the entire information of the sound 
field produced at the loudspeaker position by an arbitrary source q (see 
Fig. 12.2). The propagation from the source to the loudspeaker position rj 
can be modeled by a transfer function Hi{uj). By incorporating the weighting 
factors Ai{u)) into we can calculate the loudspeaker driving signals as 

W,{u) = Hi{u)Q{uj). ( 12 . 6 ) 

Q(uj) denotes the Fourier transform of the source signal q. Transforming this 
equation back into the discrete-time domain, the vector of loudspeaker driving 
signals w[fc] = [wi, . . . ,Wi, . . . , can be derived by a multichannel con- 
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Figure 12.3 Principle of data-based rendering. denotes an appropriate impulse response 
that is convolved with the source signal g[fe) to derive the ith loudspeaker driving signal. 



volution of measured or synthesized impulse responses with the source signal 
q[k]. This reproduction technique is illustrated in Fig. 12.3. 

Arranging the discrete-time transforms of Hi{uj), i.e. the impulse responses 
hi [/c] , in a vector h[A:] = [h i , . . . , , . . . , , the loudspeaker driving signals 

can be expressed as 

w[A;] = h[k] * q[k], (12.7) 

where the symbol * denotes the time-domain convolution operator. 

The impulse responses for auralization cannot be obtained the conventional 
way by simply measuring the response from a source to a loudspeaker posi- 
tion. In addition to the sound pressure also the particle velocity is required to 
extract the directional information. This information is necessary to take into 
account the direction of the traveling waves for auralization. The room impulse 
responses have to be recorded by special microphone setups and extrapolated 
to the loudspeaker positions as shown in Sec. 4. 

Data-based rendering allows for a very realistic reproduction of the acoustics 
of the recording room. The main drawback of this reproduction technique is 
the high computational load caused by the multichannel convolution (12.7), 
especially when using long room impulse responses. Data-based rendering is 
typically used for high-quality reproduction of static scenes. 

3.2 MODEL-BASED RENDERING 

Model-based rendering may be interpreted as a special case of the data-based 
approach: Instead of measuring room impulse responses, models for acoustic 
sources are used to calculate the loudspeaker driving signals. Point sources and 
plane waves are the most common models used in model-based WFS systems. 

Based on these models, the transfer functions Hi (ui) in (12.6) can be obtained 
in closed form. For a point source, Hi{u) can be derived as [7] 

V27T 



Hiiui) = K 



( 12 . 8 ) 
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is a geometry dependent constant and Tq denotes the position of the point 
source. Transforming Wi{u) back into the time domain and employing time 
discretization, the loudspeaker driving signals can be computed from the source 
signal g[A:]by delaying, weighting and filtering, i.e. 

Wi[k] = Ci ( g[k] * - «] ) ♦ g[/t], (12.9) 

where Cj and k denote an appropriate weight ing fact or and delay respectively, 
and g\k] is the inverse Fourier transform of jf3/27C. Multiple (point) sources 
can be synthesized by superimposing the loudspeaker signals for each source. 

Similar models exist for plane wave sources. Plane waves and point sources 
can be used to simulate classical loudspeaker setups, like stereo and 5-channel 
systems. Thus, WFS is backward compatible to existing sound reproduction 
systems. It can even improve them by optimal setup of the virtual loudspeakers 
in small listening rooms. 

Model-based rendering results in lower computational load compared to 
data-based rendering because convolutions with long room impulse responses 
are not required. Rendering of moving sound sources does not call for sig- 
nificant additional effort and also reproduction of virtual sources in front of 
the loudspeaker array (i. e. inside the listening area) is possible. The charac- 
teristics of the recording room can be simulated by placing additional point 
sources that model acoustic reflections (e. g. image method [15]). The model- 
based approach is especially interesting for synthetic or parameterized acoustic 
scenes. 

3.3 HYBRID APPROACH 

For scenes with long reverberation time, data-based WFS systems require a 
lot of computational resources. In such applications a hybrid approach can be 
used that achieves similar sound quality with reduced computational complex- 
ity. 

Here, the direct sound is reproduced by a virtual point source located at the 
desired source position. The reverberation is reproduced by synthesizing several 
plane waves from various directions to obtain a perceptually diffuse reverberant 
sound field. The computationally demanding convolution has to be performed 
only for each plane wave direction instead of each individual loudspeaker as 
would be necessary for data-based WFS systems. Some realizations of WFS 
systems successfully synthesize the reverberation with about eight plane waves 
[8, 16]. 

4. WAVE FIELD ANALYSIS 

Data-based rendering as introduced in the previous section requires to gain 
knowledge about the impulse responses h[A:] needed for auralization. In prin- 
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ciple it is possible to measure the impulse responses by placing microphones at 
the loudspeaker positions in the scene to be recorded and measuring the impulse 
responses from the source(s) to the microphones [14]. This approach has one 
major drawback: The recorded scene dataset is limited to one particular loud- 
speaker setup. A higher degree of flexibility could be obtained by analyzing 
the acoustic wave field inside the entire area of interest, which could then be 
reproduced by arbitrary loudspeaker setups. Wave field analysis techniques are 
used to analyze the acoustic wave field for a region of interest. The following 
section introduces the wave field analysis techniques used in the context of 
WFS. These techniques are mainly derived from seismic wave theory [17]. 

In general, there will be no access to the entire two-dimensional pressure 
field P(r,o;) for analysis. However, the Kirchhoff-Helmholtz integral (12.1) 
allows to calculate the acoustic pressure field P(r,cj)from the sound pressure 
and its gradient on the surface enclosing the desired field and vice versa (see 
Section 2.1). Therefore, the Kirchhoff-Helmholtz integral (12.1) can not only 
be used to introduce the concept of wave field synthesis, but also to derive 
an efficient and practical way to analyze wave fields. We will introduce the 
analysis tools for two-dimensional wave fields in the following. This requires 
a two-dimensional formulation of the Kirchhoff-Helmholtz integral, as given 
e.g. in [17]. As a result, it is sufficient to measure the acoustic pressure and 
its gradient on a line surrounding the area of interest in order to capture the 
acoustics of the entire area. 

To proceed with the analysis of wave fields, a closer look at the acoustic 
wave equation is required. The eigensolutions of the acoustic wave equation 
in three dimensional space appear in different forms, depending on the type of 
the adopted coordinate system. In a spherical coordinate system the simplest 
solution of the wave equation are spherical waves, while plane waves are a sim- 
ple solution in Cartesian coordinates. Of course, plane waves can be expressed 
as a superposition of spherical waves (see Huygens’ principle) and vice versa 
[13]. Accordingly, two major types of transformations exist for three dimen- 
sional wave fields, the decomposition of the field into spherical harmonics or 
into plane waves. The decomposition into spherical harmonics is used in the 
Ambisonics system [2] whereas the decomposition into plane waves is often 
utilized for WFS. We will introduce the decomposition into plane waves in the 
following. To simplify matters in the discussion it is assumed that the region 
of interest is source free. See [18] for a more general view. 

The next step is to transform the pressure field p(r, t) into plane waves 
with the incident angle 9 and the temporal offset r with respect to an arbitrary 
reference point. This way any recorded datap(r, i) become independent from 
the geometry used for recording. This transformation is called plane wave 
decomposition and is given by 

p{9,t) = V{p{r,t)}, 



(12.10) 
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where p{6, r) denotes the plane wave decomposed wave field and V the plane 
wave decomposition operator. The plane wave decomposition maps the pres- 
sure field into an angle, offset domain. This transformation is also well known 
as the Radon transformation [19] from image processing. The Radon transfor- 
mation maps straight lines in the image domain into Dirac peaks in the Radon 
domain. It is therefore typically used for edge detection in digital image pro- 
cessing. The same principle applies for acoustic fields: A plane wave can be 
understood as an ’edge’ in the pressure field p(r, t). 

One of the benefits using a spatial transform of the wave field is, that plane 
waves can be extrapolated easily to other positions [17] 

P{r,0,aj) = ^ (12.11) 

where r and 9 denote the position in cylindrical coordinates with respect to the 
origin of the plane wave decomposition. This allows, in principle, to extrapolate 
arecorded wave field to arbitrary points without loss of information. Wave field 
extrapolation can be used to extrapolate a measured field to the loudspeaker 
positions for reproduction purposes or to create a complete image of the captured 
sound field. The impulse responses required for data-based rendering (see 
Section 3.1) can be obtained by measuring the plane wave decomposed impulse 
response of the wave field and extrapolation to the loudspeaker positions. 

The main benefit of the plane wave decomposition is that it is possible to 
capture the acoustical characteristics of a whole area through a measurement 
on the boundary of this area. Plane wave decomposed impulse responses are 
therefore an efficient tool to describe the acoustical properties of a whole area 
of interest. 

For practical reasons the acoustic pressure and its gradient will only be mea- 
sured on a limited number of positions on the circle. This spatial sampling can 
result in spatial aliasing in the plane wave decomposed field if the sampling the- 
orem is not taken into account. An efficient implementation of the plane wave 
decomposition for circular microphone arrays can be found in [18]. Pressure 
and pressure gradient microphones placed on a circle surrounding the area of 
interest are used, together with the duality between plane waves and cylindrical 
harmonics for this realization. 

5. LOUDSPEAKER AND LISTENING ROOM 
COMPENSATION 

The theory ofWFS systems as described in Section 2. was derived assuming 
free field propagation of the sound emitted by the loudspeakers. However, real 
listening rooms do not exhibit free field propagation nor do real loudspeakers 
act like point sources. Therefore, the compensation of the non-ideal effects of 
listening rooms and loudspeakers is required to arrive at practical WFS systems. 
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Acoustic reflections off the walls of the listening room can degrade the sound 
quality, especially the perceptibility of the spatial properties of the auralized 
acoustic scene. Listening room compensation aims at improving the perceived 
quality in sound reproduction in non-anechoic environments. 

Additionally, the theory ofWFS assumes that the secondary sources act like 
ideal monopole sources. Real loudspeakers however, behave more or less dif- 
ferent from this idealistic assumption. They show deviations from the monopole 
characteristic as well as deviations from an ideally flat frequency response. This 
holds especially for the loudspeakers used for WFS systems because small di- 
mensions are mandatory in order to arrive at a reasonable aliasing frequency (see 
Section 2.1). Loudspeaker compensation aims at reducing the effects caused 
by the non-ideal characteristics of the loudspeakers used for reproduction. The 
next sections introduce possible solutions to room and loudspeaker compensa- 
tion. 

5.1 LISTENING ROOM COMPENSATION 

When listening to a recording that contains the reverberation of the recorded 
scene, the reverberation caused by the listening room interferes with the 
recorded sound field in a potentially negative way [20]. Figure 12.4 shows 
the effect of the listening room on the auralized wave field. The WFS system 
in the listening room composes the recorded acoustic scene by positioning the 
dry source signals at their respective positions relative to the recording room 
together with the acoustic effect of the recording room. The size and position of 
the recording room relative to the listening room is shown by a dashed line for 
reference. The dashed lines from one virtual source to one exemplary listening 
position show the acoustic rays for the direct sound and one reflection off the 
side wall of the virtual recording room. The solid line from one loudspeaker to 
the listening position shows a possible reflection of the loudspeaker signal off 
the wall of the listening room. This simple example illustrates the effect of the 
listening room on the auralized wave field. Perfect listening room compensation 
would entirely eliminate the effects caused by the listening room. 

Room compensation can be realized by calculating a set of suitable com- 
pensation filters to pre-filter the loudspeaker signals. Unfortunately there are 
a number of pitfalls when designing a room compensation system. A good 
overview of classical room compensation approaches and their limitations can 
be found in [21]. Most of the classical approaches fail to provide room com- 
pensation for a large listening area [22, 23]. There are three main reasons for 
this failure: problems involved with the calculation of the compensation filters, 
lack of directional information for the wavefronts and lack of control over the 
wave field inside the listening area. Most classical room compensation systems 
measure the impulse responses from one or more loudspeakers to one or more 
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Figure 12.4 Simplified example that shows the effect of the listening room on the auralized 
wave field. The dashed lines from one virtual source to one exemplary listening position show 
the acoustic rays for the direct sound and one reflection off the side wall of the virtual recording 
room. The solid line from one loudspeaker to the listening position shows a possible reflection 
of the loudspeaker signal off the wall of the listening room. 



microphone positions. Unfortunately typical room impulse responses are, in 
general, non-minimum phase [24] which prohibits to calculate exact inverses 
for each measured impulse response. The measurements often involve only 
the acoustic pressure at the point they where captured. Without directional 
information on the traveling wave fronts a room compensation system will fail 
to determine which loudspeakers to use for compensation. Additionally most 
classical systems do not provide sufficient control over the wave field. They 
utilize only a limited number of loudspeakers. WFS may provide a solution 
for the last two problems concerning the lack of directional information for the 
traveling wave fronts and lack of control over the wave field. 

5.1.1 Room Compensation for Wave Field Synthesis. As wave field 
synthesis in principle allows to control the wave field within the listening area 
it can also be used to compensate for the reflections caused by the listening 
room. Of course, this is only valid up to the spatial aliasing frequency (12.4) 
of the particular WFS system used. Above the aliasing frequency there is no 
control over the wave field possible. Destructive interference to compensate 
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Figure 12.5 Block diagram of a WPS system including the influence of the listening room and 
the compensation filters to cope with the listening room reflections 

for the listening room reflections will fail here. Some indications for what can 
be done above the aliasing frequency will be given in Section 5.1.2. 

In the following a possible solution to room compensation with WFS will 
be introduced, more details can be found in [9, 11, 12]. Figure 12.5 shows 
the signal flow diagram of a room compensation system utilizing WFS. The 
primary source signal q[k] is first fed through the WFS system operator h[fc] 
(see Section 3). The influence of the listening room is compensated by com- 
pensation filters C[A;]. The resulting signals are then played back through the 
M loudspeakers of the WFS system. The loudspeaker and listening room char- 
acteristics are contained in the matrix R[/c], Instead of using the microphone 
signals directly we perform a plane wave decomposition of the measured wave 
field into L plane waves as described in Section 4. The transformed wave field is 
denoted as R. Using the plane wave decomposed wave fields instead of the mi- 
crophone signals has the advantage that the complete spatial information about 
the listening room influence is included. This allows to calculate compensation 
filters which are in principle valid for the complete area inside the loudspeaker 
array, except for secondary effects not covered here that limit the performance 
[25]. 

The auralized wave field inside the listening area is expressed by the vector 
l[k]. According to Fig. 12.5, the auralized wave field L(tu) is given as 

L(w) = R(w) • C(o;) • H(m) • Q{u). (12.12) 

In principle, perfect listening room compensation would be obtained if 

R(w) • C(w) = F(m), (12.13) 

where F denotes the plane wave decomposed wave field of each loudspeaker 
under free field conditions. 

Using the multiple-input/output inversion theorem (MINT) it is possible to 
solve the above equation exactly under certain realistic conditions as shown 
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in [26]. However, many applications only require to compensate for the first 
reflections of the room. Using a least-squares error (USE) method to calculate 
the compensation filters results in approximated inverse filters with shorter 
length compared to the MINT. 

An efficient implementation of room compensation can he derived hy inte- 
grating the WES operator into the room compensation filters. Off-line calcula- 
tion of C{u) • H(cj) results in only M filters that have to he applied in real-time 
to the source signal. This is significantly less than in the approach presented 
above. 

5.1.2 Compensation Above the Aliasing Frequency. Above the alias- 
ing frequency of the WES system no destructive interference can be used to 
compensate for the reflections caused by the listening room. Perfect compen- 
sation in the sense of a resulting free field wave propagation for each loudspeaker 
is therefore not possible above the aliasing frequency. A possible approach in- 
cludes the use of psychoacoustic properties of the human auditory system to 
mask listening room reflections as indicated in [9]. 

5.2 LOUDSPEAKER COMPENSATION 

Typically, two different types of loudspeakers are used to build arrays for 
WES. The first type are classical cone loudspeakers and the second one are 
multiexciter panels. 

Cone loudspeakers as used for WES suffer from the small dimensions re- 
quired for a reasonable aliasing frequency of the WES system. Broadband 
cone loudspeakers normally show a narrowed radiation pattern on the main 
axis at high frequencies [5] and a non-ideal frequency response. Most of these 
problems can be solved to some extend by using high-quality (two-way) loud- 
speakers. 

Eor the second type of loudspeakers used for WES systems, multiexciter 
panels (MEP) [27], these non-ideal effects are much more severe. MEPs have 
a relatively simple mechanical construction. They consist of a foam board with 
both sides covered by a thin plastic layer. Multiple electro-mechanical exciters 
are glued equally spaced on the backside of the board. The board is fixed on 
the sides and placed in a damped housing. Because of this simple mechani- 
cal construction their frequency response is quite disturbed by resonances and 
notches. The benefits of MEPs are the simple construction and the possibility 
of seamless integration into walls. 

Eigure 12.6 shows the magnitude of the measured on-axis frequency re- 
sponses of four exciter positions on a MEP prototype. As can be seen from 
the measured frequency responses, they do not only differ significantly from an 
ideally fiat frequency response, they also show a position dependency. Multi- 
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Figure 12.6 On-axis magnitude frequency responses of four exciter positions on a multiexciter 
panel (MEP). 



channel loudspeaker compensation is therefore mandatory when using MEPs 
for auralization purposes. 

Loudspeaker compensation can be seen as a subset of room compensation 
in our framework. By measuring the ’room’ impulse responses in an anechoic 
chamber the loudspeaker characteristics can be captured and compensated the 
same way as described in Section 5.1. The compensation filters include the 
compensation of the radiation pattern using the neighboring loudspeakers as 
well as the compensation of the frequency response and reflections caused by 
the panels itself. 

However, this direct approach has one major drawback: The inverse filters 
include the compensation of potentially deep dips in the frequency response 
with high energy peaks for some frequencies. This can lead to clipping of 
the amplifiers or overload of the electro-mechanical system of the exciters. 
This problem can be solved by performing a smoothing of the magnitude of the 
recorded frequency responses. Additionally, the length of the inverse filters can 
be reduced by splitting the impulse responses into their non-minimum phase and 
minimum phase parts and using only the minimum phase part for the inversion 
process. 
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As for listening room compensation these approaches are only valid up to the 
aliasing frequency of the WFS system. Above the aliasing frequency there is no 
control over the produced wave field possible. Only individual compensation 
of the frequency response of each exciter can be performed here. A complete 
algorithm using the above described techniques is presented, e.g., in [10]. 

6. DESCRIPTION OF A SOUND FIELD 
TRANSMISSION SYSTEM 

Figure 12.7 shows a typical WFS-based sound field recording and reproduc- 
tion system. The recording room contains equipment for dry sound recording 
of the primary sound sources, their positions, and room impulse responses. As 
stated earlier, these impulse responses, in general, cannot be obtained by ap- 
plying conventional room impulse response measurement procedures but need 
to be retrieved by using the techniques as described in Section 4. 

Returning to Fig. 12.7, the size and position of the listening room relative 
to the recording room is shown by dashed lines. The loudspeaker array in 
the listening room is depicted schematically as a closed line surrounding the 
listening area (highlighted in gray). Note that, in this context, any setup of 
linear loudspeaker arrays can be utilized for sound field reproduction, as long 
as the inter-loudspeaker spacing remains constant. 

The WFS system in the listening room composes an acoustic scene by posi- 
tioning the dry source signals at their respective positions relative to the record- 
ing room and by taking into account the acoustic properties of the room to 
be auralized. The virtual sources created by this process may lie outside the 
physical dimensions of the listening room. This creates the impression of an 
enlarged acoustic space. 

The following sections briefly detail typical realizations for the recording 
and reproduction of sound fields. 

6.1 ACQUISITION OF SOURCE SIGNALS 

Up to this point, one link in the chain that leads to realistic sound stage 
reproduction using wave field synthesis has been silently ignored, namely the 
acquisition of the source signals. 

Flexible WFS-based rendering requires each source to be recorded in a way 
such that the influence of the room acoustics and other possibly competing sound 
sources, i.e. crosstalk between the acquired audio channels, is minimized. This 
concept is required for taking advantage of one of the central properties of 
WFS-based rendering, which is the ability to realistically auralize an acoustic 
scene in a different acoustic environment than it was originally recorded in. 

For traditional multichannel recordings, there exist a vast variety of well 
established main- and multi-microphone techniques that are being utilized by 
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Figure 12.7 A WFS-based sound field recording and reproduction system 

Tonmeisters for high-quality sound acquisition, e.g. [28]. In the WFS context, 
however, one of the main disadvantage of many traditional recording techniques 
is the natural constraint on spatial resolution which may limit WFS performance. 

To comhat this drawback, microphone array technology and associated sig- 
nal processing can he used for obtaining enhanced spatial selectivity. Especially 
during live performances, microphone arrays have the further potential advan- 
tage of relieving the actors/musicians of having to carry or wear close-up mi- 
crophones. Microphone arrays also may be able to add a new level of flexibility 
and creativity to the post-production stage by means of digital beamforming. 

Ideally, for high-quality WFS-based applications, a general-purpose micro- 
phone array-based sound recording system should meet several requirements 
including 

■ high and flexible spatial selectivity, 

■ very low distortion and self noise, 

■ immunity against undesired ambient noise and reverberation, 

■ low-delay and real-time operation, 

■ frequency-independent beampattern, and 

■ sonic quality comparable to studio microphones. 

Unfortunately, the third item describes a yet unsolved problem. However, 
promising microphone array designs have been proposed that meet many of the 
other requirements (see e.g. Chap. 3 in this book, or [29]), and can therefore be 
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Figure 12.8 Block diagram of WFS system for reproduction of N sources using M loud- 
speakers. Depending on the application M may range from several lens to several hundred 
loudspeakers. 



utilized as a recording front-end for WFS-based systems in designated recording 
environments. 

As indicated in Fig. 12.7, WFS-based reproduction systems need to acquire 
information about the source positions for spatially correct and realistic sound 
stage reproduction. Again, microphone array systems implementing advanced 
acoustic source localization algorithms can be of benefit for this task. The 
interested reader is referred to Chapters 8 and 9 in this book. 

6.2 SOUND STAGE REPRODUCTION USING WAVE 
EIELD SYNTHESIS 

In an actual implementation of a WFS reproduction system, the loudspeaker 
driving signals of the array, w[A:], are synthesized from the source signals q[A;] = 
[ 9 I) • ■ • ) 9m • • ■ > <1nV' convolution operations with a set of filters hi^n[k] as 

shown in Fig. 12.8. Note that this figure depicts a multi-source scenario in 
contrast to the theory presented in the previous sections where a single source 
has been assumed for auralization. The multi-source scenario can be derived 
from the single-source case by simply applying the superposition principle. 

Depending on the application, the filters not only include source positions 
and the acoustics of the recording room, which have been acquired during 
the recording stage (see Fig 12.7), but also loudspeaker and listening room 
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compensation filters. In the simplest case of model-based rendering the filter 
coefficients, hi^n[k], can be directly derived from (12.9). 

A WFS-based reproduction system needs to process and reproduce, often in 
real-time, a high number of audio channels simultaneously. Often, the focus is 
on a PC-based implementation in contrast to DSP-based (embedded) solutions, 
because PCs potentially preserve a high degree of freedom for development 
platforms. 

On a hardware level, a typical WFS system consists of one or more PCs per- 
forming all the required digital signal processing, multichannel digital audio 
interfaces, D/A-converters, amplifiers, and loudspeaker systems. State-of-the- 
art PCs nowadays allow for real-time digital signal processing of a high number 
of audio channels. Standard operating systems and high-level programming 
languages enable rapid implementation of the algorithms as well as graphical 
user interfaces. All the necessary hardware components are, in principle, com- 
mercially available at the time this book was printed. The design of special 
purpose hardware for WFS systems has the benefit of higher integration, more 
specialized functionality and lower cost. An example could be the design of 
multichannel power amplifiers with integrated D/A converters and digital signal 
processors. 

7. SUMMARY 

The wave field synthesis technology described in this chapter is capable of re- 
producing acoustic scenes for communication and entertainment applications. 
It is based on the basic laws of acoustical wave propagation. The required 
transition from the physical description to a discrete-time, discrete-space re- 
alization with current multichannel audio technology is well understood. The 
main components of this transition can be described in terms of geometri- 
cal considerations, spatial transformations, temporal and spatial sampling, and 
multichannel equalization. 

The basic signal processing structure in the form of a multiple-input, 
multiple-output convolution allows a unified system implementation, even if 
different kinds of rendering approaches are used. This unified structure for the 
implementation on the one hand and the flexibility and adaptability to different 
applications on the other hand makes wave field synthesis a suitable technology 
for sound field rendering. 
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Abstract Future tele-collaboration systems need to support seamless spatial sound repro- 
duction to achieve realistic immersion. In this chapter, we present techniques 
based on binaural signal processing, capable of encoding and rendering sound 
sources accurately in three-dimensional space using only two recording/playhack 
channels, thus creating the illusion of a virtual source in space. The technique 
exploits the mechanisms that help to decode spatial information in the auditory 
system. An overview of the encoding and decoding mechanisms is given and 
their practical application to the design of virtual spatial sound (VSS) systems is 
discussed. 

Keywords: Binaural Hearing, Binaural Technology, Virtual Sound, Spatial Sound, HRTF, 

HRIR, Crosstalk Cancellation 



1. INTRODUCTION 

In this chapter, we are concerned with the rendering of sound sources in three- 
dimensional space using only two playback channels. We focus our study on 
a family of techniques based on the synthesis of binaural signals. Given that 
the human hearing mechanism is binaural, all spatial information available 
to the listener is encoded within two acoustic signals (i.e. one to each ear). 
Thus in principle, only two playback channels are necessary and sufficient to 
realistically render sound sources localized at any point in space. Since the 
sound is not perceived as coming from any given loudspeaker the technique 
creates the illusion of a virtual sound source in free space. 

There are many ways to create the impression that a sound source is located at 
a given point in space away from the loudspeakers. One familiar example is the 
playback over two speakers commonly known as stereo reproduction, where a 
sound source can be positioned within the angle subtended by the loudspeakers 
by simply manipulating the gains assigned to the source in each channel (i.e. 
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amplitude panning). If the desired rendering position of a source is outside 
of this region, then the angle between speakers needs to be increased or more 
rendering channels (and corresponding gain reassignments) have to be added. 
The success of amplitude panning depends highly on the location of the listener 
with respect to the loudspeakers. The location where amplitude panning (or any 
other playback technique) is most effective is commonly known as the sweet 
spot. Another technique, which renders virtual spatial sound accurately and has 
a large sweet spot, but which requires multiple playback channels is wavefield 
synthesis. This technique is described in Chapter 12. 

The advantage of binaural systems over other techniques is that only two 
playback channels are required, and a virtual sound source can be positioned at 
any desired point in space with great accuracy. This advantage does not come 
without difficulties. The binaural signal has to be delivered to the listener’s 
ears with great fidelity to be realistic. Generally headphones are used to render 
binaural signals, but in some applications this is not practical. As discussed 
later, techniques using standard stereo loudspeaker setups also exist, although 
their effectiveness is confined to very small sweet spots. 

Another problem in binaural audio is the idiosyncratic nature of the binaural 
signal. Sound is imprinted with an acoustic signature particular to each listener 
before it reaches the ear drums. Differences in head and body dimensions 
contribute to audible acoustic changes that the brain learns to associate with 
the direction of the source. Thus, if a listener is presented with the binaural 
signal recorded at the ear canals of another person, localization errors generally 
increase (especially in the median plane) and realism and immersion will be 
somewhat compromised. 

While there are still many unsolved problems and difficult challenges, prac- 
tical VSS systems capable of rendering convincing spatial sound are already a 
reality. In developed countries, an increasingly larger segment of the popula- 
tion experiences virtual spatial sound on a regular basis. Many companies offer 
very convincing virtual sound rendering products for game playing and music 
listening applications. Governments in these countries also continue funding 
research and developing advanced VSS systems for aerospace and military 
purposes. 

The field of binaural audio technology is vast and it would be very ambitious 
to attempt to cover every aspect of it in this chapter. Our goal is to outline 
the basic concepts related to this technology, and provide references relevant 
to each topic that we discuss. In Section 2 we give a brief introduction to the 
basics of spatial hearing. Next, in Section 3 we discuss the acoustics of spatial 
sound. Finally we discuss the design of VSS systems. Before we begin our 
discussion it is convenient to establish the scope of the applications that we will 
be considering in this chapter. 
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Figure 13.1 Application of a VSS system in a tele-collaboration environment. 



1.1 SCOPE 

The techniques that we describe constitute one particular way of rendering 
realistic soundstages in the tele-collaboration environment. Their applicability 
and effectiveness will depend on the particular application scenario and their 
implementation constraints. In a tele-collaboration environment, the goal of a 
VSS system is to provide the listener with a realistic immersion into a virtual 
environment shared with the other conference participants. Depending on the 
degree of realism with which this is achieved, the results will vary from in- 
creased speech intelligibility due to the virtual spatial separation of the various 
participants (i.e. reduction of the cocktail party effect [21]), to complete immer- 
sion. A general application scenario is depicted in Fig. 13.1. Some applications 
that correspond to this general scenario are for example: 

■ corporate meetings, 

■ business transactions, 

■ interactive multiplayer games, 

■ remote education, 

■ live interactive music recording/performance, 

■ virtual gatherings. 

These applications have very different requirements with respect to sound- 
stage rendering. In some instances (such as multiplayer games), headphone 
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rendering is possible and appropriate, while in others (virtual gatherings) it is 
inconvenient or unnatural. Another design variable includes the noise level in 
the acoustic environment, which in some cases is low (e.g. corporate meet- 
ing) and in others it may be higher. The acoustics of the room (e.g. office, 
living room, studio, etc) will affect the experience in free-field listening con- 
ditions (e.g. loudspeaker rendering). Implementation issues such as latency 
and computational complexity are also limiting factors in some cases (e.g. live 
interactive performance, games, etc). 

For the rest of the chapter we assume that some mechanism is in charge of 
encoding, transmitting and decoding the sound sources and information about 
their spatial distribution. The VSS system is either fed with the individual 
source signals and (optionally) information about their spatial distribution, or 
with a binaural signal where all sources are already encoded into their respective 
positions. In each instance, the design of the VSS system will face different 
challenges as we describe in Section 4. 

2. SPATIAL HEARING 

Humans and many other animal species have the remarkable ability of iden- 
tifying (with various degrees of accuracy) the direction of a sound source orig- 
inating from any point in three-dimensional space. It has been proposed that 
this ability evolved as a survival mechanism that allowed ancient organisms to 
aurally localize predators and/or prey. This mechanism relies on the availabil- 
ity of multiple sensory inputs, which in the case of humans consist of the two 
acoustic signals reaching the ears. The placement of the ears on the horizontal 
plane maximizes lateral differences for sound sources around the listener [12]. 

The human auditory system is very sophisticated and thus capable of analyz- 
ing and extracting most spatial information pertaining to a sound source from 
these two signals. However, the process of localizing a sound source is dynamic 
and often aided and complemented with other sensory inputs (e.g. visual and/or 
vestibular). As we will describe later on this chapter, the cross-modal nature 
of the sound localization process contributes to making synthesis of realistic 
virtual sound sources a particularly difficult engineering challenge. 

We next describe some of the basic mechanisms of sound localization. Our 
goal is to familiarize the reader with the terminology and the basics of spatial 
hearing. For a more detailed and in-depth treatment of these topics the reader 
is referred to the literature provided. 

2.1 INTERAURAL COORDINATE SYSTEM 

We start our discussion by placing a hypothetical listener in the center of the 
sound field. We define an interaural-polar coordinate system whose origin is at 
the center of the head (see Fig. 13.2) and a rotation axis defined by the positions 
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Figure 13.2 Interaural coordinate system. 



of the two ear canals [23]. Here, the lateral angle 9 is measured between a ray 
to the source and the median plane, and the polar angle (f) is measured between 
the projection of that ray onto the median plane and the horizontal plane. The 
ear closest to the source is called ipsilateral and the opposite ear is called 
contralateral. In this system, —90° < 9 < 90°, where 9 = —90° corresponds 
to the left side of the listener, and 9 = 90° to the right side. The range of polar 
angle is —90° < <^ < 270°, where (p = —90° is below the listener, 0 = 0° is in 
front, = 90° is directly above of the listener’s head, and 0 = 180° is in back. 

Notice that different coordinate systems are used in the various studies found 
in the literature. One common system is the vertical polar system where the 
axis of rotation is defined in the up/down direction with respect to the subject. 
The choice of the interaural system for this discussion will become clear when 
we describe acoustic measurements in Section 4.2. 

2.2 INTERAURAL DIFFERENCES 

One of the basic binaural processing mechanisms involves the comparison 
between the time of arrival of the sound to the left and right ears. This process 
takes place in part of the auditory pathway from the cochlea to the auditory 
cortex known as the superior olive and can be modeled by cross-correlation be- 
tween channels, followed by detection ([13], p. 144). The lag corresponding to 
the maximum peak of the cross-correlation function determines the perceived 
lateral angle of the sound source. This difference is commonly known as inter- 
aural time difference (ITD). 

If we assume that the average distance between human ears is about 18 cm 
[22], the ITD has a maximum value of about ±0.75 ms. Notice that the ITD 
will not uniquely determine the direction of a sound source since there will exist 
ambiguity with respect to the front and back hemispheres (see Fig. 13.3). This 
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Figure 13.3 Interaural time difference magnitude for a spherical head of radius o = 9 cm on 
the horizontal plane. The ITD function has opposite signs with respect to the median plane. 



ambiguity contributes to what is known as, front-back confusion or front-back 
reversal. 

The presence of the head in the sound field introduces a frequency depen- 
dency on the ITD. This dependency is accounted for by the subband analysis 
carried out in the auditory system. Several researchers have measured and mod- 
eled this frequency dependency [34, 17]. This dependency on frequency shows 
mainly at small lateral angles, where the ITD at lower frequencies (/'< 1.5 kHz) 
is 50% larger than at higher frequencies. This discrepancy is not obvious and it 
is believed to be introduced by creeping waves at higher frequencies. See [35] 
for more details of the physics of these phenomena. 

Another consequence of the presence of the head is that higher frequencies 
are attenuated or shadowed by the head as they reach the contralateral ear. This 
attenuation produces an interaural-level difference (ILD) which also plays a 
major role in lateral localization, especially at higher frequencies. Not only 
the contralateral signal suffers from low-pass filtering and attenuation, but the 
ipsilateral signal is boosted above a certain cut-off frequency (i.e. high shelf.) 
as the source moves laterally towards the interaural axis. If we assume that to 
a first degree of approximation the human head is spherical we can compute 
and plot the head shadow as a function of incidence angle as shown in Fig. 13.4 
[24]. Notice the 6 dB boost for normal incidence. The ILD is computed by 
taking the ratio of the two shadow functions whose angles correspond to the 
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Figure 13.4 Magnitude response of a spherical head of radius a = 9 cm for various angles of 
incidence The response of only one ear at i/'o = 0 is shown. 



angles of the source to each ear. Notice that the range of the head shadow will 
cause the ILD to have variations in the ±25 dB range. 

The ITD and ILD are considered to he the primary cues for the perceived 
lateral angle of a sound source as proposed hy Rayleigh in what is known as the 
Duplex theory [47]. For a given lateral angle, changes in the polar angle of a 
source cannot he predicted hy the Duplex theory or obtained from a spherical- 
head model. For this reason the conical surface described described at a constant 
lateral angle for all possible polar angles is often called cone of confiisionK In 
principle, knowledge of the ITD and ILD would allow one to estimate the lateral 
angle, and hence to constrain the location of the source to a particular cone of 
confusion. Localization in elevation is well developed in humans, but involves 
other auditory cues as described next. 

2.3 SPECTRAL CUES 

In the median plane (i.e. 0 = 0°), the bilateral symmetry of the body implies 
that both the ITD and the ILD must vanish. However humans are still able to 
localize sound in the median plane by what is known as monaural cues, which 
are related to the spectral changes introduced by the outer ears (i.e. pinnae) 
at higher frequencies [48, 49, 7, 56] and other body structures like the torso 
at lower frequencies [8, 4]. With broadband sources an hypothesis for sound 
elevation localization is based on the notion that the listener is familiar with 
the timbre of the source, and decoding the polar angle involves comparing the 
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memorized spectrum from the perceived spectrum. The spectral shape of the 
difference will determine the perceived polar angle. For narrowband sources, 
the perceived polar angle depends mainly on the frequency of the stimulus [13] 
(p. 45) and [18]. Effective localization of unfamiliar sources in the median 
plane can only he achieved with head motion. 

The symmetry of the spherical model introduced in the previous section 
suggests that the monaural and ILD spectra should he independent of polar 
angle, which is contrary to fact in humans. Thus, it is likely that this dependence 
of spectra on polar angle can provide hinaural as well as monaural cues to 
elevation. In the same way, the spherical-head model explains the head shadow 
part of the dependence of the ILD spectrum on lateral angle, hut monaural and 
ILD spectra are rather complicated functions of lateral angle as well [23, 39]. 
If this is the case, then is it possible to use monaural cues to localize in lateral 
angle? Some studies have shown that monaural cues help monaural listeners 
(i.e. listeners with complete hearing loss in one ear) to localize the lateral 
direction of a source with relatively high accuracy. However, this was not the 
case for fully-binaural subjects with a blocked ear [39]. 

Spectral cues are also used to discriminate the front from the back when the 
sound source has sufficient high-frequency energy (f>3 kHz). These cues are 
believed to be introduced by the front/back asymmetry of the pinna which result 
in a pinna shadow for sound sources arriving from the back. In the absence of 
this cue, head rotation is necessary to resolve front/back ambiguity [53]. 

2.4 DISTANCE CUES 

Monaural spectra, ITD and ILD vary with lateral angle, polar angle, and 
distance. The range dependence is significant when the source is very close, 
but is relatively unimportant at greater distances. For distances of less than 1 m, 
the ILD increases significantly for lateral sources, but the ITD and spectral cues 
are similar to those of distant sources. With this evidence some authors have 
proposed that binaural cues play an important role for nearby sources [15, 16]. 

At large distances interaural differences and spectral cues are not reliable 
cues to estimate the distance of a source. One of the most useful cues for 
range estimation is loudness. It is well known that the loudness (and to a lesser 
degree, the spectra) of a sound source changes with distance . As with median- 
plane localization, the effectiveness of this cue depends on the familiarity of 
the listener with the source. For unfamiliar sound sources and distances larger 
than 3 m the perceived distance of a source is proportional to its loudness and 
not to the true distance [41, 12]. 

Other more salient cues for distance perception are the cues derived from the 
acoustic environment. Reverberation and/or reflections from nearby surfaces 
play a major role in distance perception. The ratio of reverberation (or reflected) 
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to direct sound (D/R ratio) is a function of the relative distance between source 
and listener and the room acoustics. This cue can be more reliably used by 
listeners even if they have no familiarity with the particular sound source [38]. 
As we discuss later in Section4.3.1 in headphone listening conditions these cues 
play a major role for externalization. 

2.5 DYNAMIC CUES 

Sound localization is not restricted to static sources or listeners. Localiza- 
tion in everyday life includes head and body motion. As the head moves, the 
spatial cues will change according to the nature of the motion, and this will 
have a net effect on localization. Theories that study this phenomenon are often 
called motional theories [13]. Dynamic cues are extremely useful to resolve 
the ambiguities that static cues cannot handle. Many studies have shown that 
when subjects are allowed to move the head, localization blur and front/back 
reversals are significantly reduced [51]. Localization is frequently reinforced 
by other sensory inputs, such as visual and vestibular, which also have dynamic 
properties. Experiments have shown that listeners evaluate interaural differ- 
ences at the same time as they move their head in relation to the direction of the 
source. All cues need to be consistent to produce the correct perception, and 
vestibular and visual cues carry information [53]. 

Unfortunately, most of the research in spatial hearing has focused on the 
static case and there are still many unanswered questions regarding the dy- 
namic behavior of the spatial hearing system. In a VSS system, the adjustment 
or correction of the relative position between tbe listener and tbe virtual source 
as either moves, needs to be accurate and seamless to provide the same experi- 
ence that the listener would have in real life. As we discuss later (Section 4.) 
this is achieved by monitoring the listener’s motion and adjusting the system 
parameters accordingly. 

3. ACOUSTICS OF SPATIAL SOUND 

All acoustic cues used by the hearing mechanism for decoding spatial infor- 
mation are encoded in the binaural signal. Understanding the physical phenom- 
ena that originates the cues is essential in the design of modern VSS systems. In 
this section, we describe how spatial cues are acoustically encoded in the binau- 
ral signal. The effect of the body on the signal is specified by the head-related 
transfer function. 

3.1 THE HRTF 

In an anechoic environment, as sound propagates from the source to the 
listener, the different structures of the listener’s own body will introduce changes 
to the sound before it reaches the ear drums. The effects of the listener’s body 
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are captured by the head-related transfer function (HRTF), the transfer function 
between the sound pressure that is present at the center of the listener’s head 
when the listener is absent and the sound pressure developed at the listener’s 
ear. The HRTF is a function of direction, distance, and frequency. The inverse 
Fourier transform of the HRTF is the head-related impulse response (HRIR), 
which is a function of direction, distance, and time. 

In the time domain, the ITD is encoded in the HRIR as differences in the 
time of arrival of the sound between ipsilateral and contralateral side. This 
can be observed in Fig. 13.5 where we show the HRIR amplitude and HRTF 
magnitude as functions of polar angle for various lateral angles. Close to the 
median plane (6 = 10°), the time of arrival of the wavefront is similar for 
both ears. However, as the lateral angle increases the time of arrival to the 
contralateral ear progressively exceeds that of the ipsilateral, thus increasing 
the ITD. The ILD is encoded as the level differences observed in the HRTF 
magnitude responses. Notice how the level difference is small near the median 
plane, and increases with lateral angle. 

In the median plane, both ITD and ILD are very small, but there are strong 
spectral variations (i.e. monaural cues) that change with polar angle as shown 
in Fig. 13.6. Here we show the HRIR amplitude and HRTF magnitude for a 
human subject in the median plane. (The left and right ears are almost identical 
due to the symmetry of the subject thus only one side is shown.) In the time 
domain (i.e. HRIR) one can see the following features; the time of arrival of 
the wavefront creates the main pulse, which seems to arrive at the same time 
for all polar angles. Closer examination reveals that for directions above the 
subject, the main pulse is broadened due to diffraction around the top of the 
head. Following the main pulse, a pinnae echo or reflection appears as the 
second brightest feature in the HRIR. Notice that the time lag of the reflection 
changes with polar angle. The perceptual effect of this echo is apparent if 
we examine the HRTF magnitude, where it manifests as pinna notches (comb 
filtering). These notches are the most prominent elevation-dependent features in 
the HRTF at higher frequencies. In the midrange, the most prominent feature is 
a broad resonance, near 3 kHz and which has more energy in frontal directions. 
Wavelengths corresponding to this resonance are in the order of the size of the 
pinna, which acts as a parabolic energy collector in front and casts a shadow 
for sounds coming from back (i.e. pinnae shadow). This resonance can be 
observed in the time domain as ripples in the HRIR. While analysis of the HRIR 
reveals the acoustic origin of some of the spectral features, the importance of 
the detailed time structure seems to be minor as shown by some localization 
experiments where the phase spectrum of the HRTF was altered [33]. 

At lower frequencies, where wavelengths are comparable to the head and 
body size, the main elevation-dependent feature is a series of notches whose 
center frequency is lowest above the subject and highest towards the floor. This 
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Figure 13.5 The HRIR (left column) and HRTF magnitude (right column) for a human subject 
(subject 3 in the CIPIC database [3]). The abscissa in each panel is polar angle, the ordinate is 
time or frequency and the grayscale is amplitude or magnitude in dB. 



pattern is consistent with reflections arriving from the shoulders and torso, as 
can be seen in the HRIR as faint reflections at larger time lags. These features 
were until recently believed to have little importance. However, studies have 
suggested that cues derived from torso diffraction are effective for localizing 
low-frequency sources away from the median plane [8, 4]. 

For other lateral angles, the HRIR and HRTF show similar features. How- 
ever, the differences between ipsilateral and contralateral sides increase, as can 
be seen in Fig. 13.5. While the HRTFs of most humans share these similarities, 
more detailed examination reveals subtle differences determined mainly by dif- 
ferences in body shape and size among subjects [45]. These subject-dependent 
differences have been shown to play a major role for precise localization. It 
is believed that only using ones own HRTF can result in realistic and accu- 
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Figure 13.6 The (a) HRIR and (b) HRTF magnitude in the median plane for a human subject. 
The abscissa is polar angle, the ordinate is time or frequency and the grayscale is amplitude in 
(a) and magnitude in dB in (b). 



rate binaural audio, as evidenced by various experiments [45]. Some studies 
have shown that some subjects can localize better with someone else’s HRTF. 
However, these cases are rare exceptions [55]. 
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3.2 ROOM ACOUSTICS 

The acoustic modifications introduced by the HRTF are not the only com- 
ponents of the total spatial-hearing experience. Generally, listeners experience 
sound in rooms or acoustic environments that introduce additional distortions. 
These distortions can be either pleasing or annoying. For example, concert halls 
are carefully designed to create pleasing modifications for both musicians and 
audience. Everyday life environments are less carefully designed and can in- 
troduce undesirable modifications that in extreme cases can even impair speech 
communication. 

Information about this acoustic environment can be drawn from analysis of 
its impulse response if one looks at it as a system. For a typical room, the 
impulse response shows the following three components: 

1. Direct path: unmodified (or scaled) sound that reaches the ear via a direct 
linear path. 

2. Early reflections: reflection from nearby surfaces such as the ceiling, the 
floor and the walls (within the first 100 ms). 

3. Late reverberation: energy reaching the ear after multiple reflections 
from all surfaces. 

The auditory system employs the acoustic information from the environment 
to determine, for example, the distance of the sound source (see Section 2.4). 
The direct path establishes the energy of the source, which is then used to esti- 
mate the D/R ratio from the total energy (see Section 2.4). The early reflections 
introduce timbre changes and echoes that give tonal quality to the room. The 
late reverberation increases the sense of spaciousness and is also used to esti- 
mate distance. Overall, the room acoustics contribute to the extemalization and 
the realism of the sound source. 

While a large fraction of the research on sound localization has been con- 
ducted in anechoic conditions, many studies have focused on sound localization 
inside rooms [29]. Under non-anechoic conditions there are psychoacoustic 
phenomena, such as the precedence effect (the perceived direction of a sound 
source corresponds to the direction of the first wavefront reaching the ears), 
that greatly affect sound localization. In general, the effect of the room is to 
increase the loca li zation blur (i.e. uncertainty about the exact position of the 
source). Since correlated sound is arriving from several different directions, 
the auditory system receives conflicting cues that it somehow is able to resolve 
into a single direction estimate. However, this estimate has large variance com- 
pared to the estimate in anechoic conditions. Unless the listener is very close 
to a reflecting surface, localization in rooms is surprisingly accurate. Recent 
studies have shown that temporal integration and adaptation in higher auditory 
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processes might be used to reduce blur and improve the estimate [50]. A VSS 
system needs to introduce or simulate room acoustic cues to achieve immersion 
and realism. This process, known as auralization will be described in the next 
section where we describe the design of a VSS system. 

4. VIRTUAL SPATIAL SOUND SYSTEMS 

While the concept of using the binaural signal to synthesize a realistic spatial 
experience is simple, the practical implications are not. In this section we 
discuss some issues in the design of practical VSS systems based on binaural 
signal processing. 

4.1 HRTF MEASUREMENT 

A key problem that arises in the design of a VSS system is the measurement 
of the HRTF. This is a long and expensive process, where human subjects (and 
sometimes mannequins) are equipped with microphones near or inside their ear 
canals to record the acoustic modifications introduced by the subject’s anatomy. 
Generally the subject is in the center of some kind of apparatus (e.g. a rotating 
hoop), where fixed and/or movable loudspeakers placed in specified directions 
play the test signals. The recording is post-processed with knowledge about 
the test signal and the measurement system to compute the HRTF. 

For such a measurement, it is quite desirable not to have the source location- 
dependent results be critically sensitive to the position of the point within (or 
outside) the ear canal where the HRTF is measured. Measurements near the 
eardrum will account for all individual localization characteristics of the listener 
including the ear canal resonance. However, such measurements are somewhat 
intrusive (and dangerous) and may not be necessary. It has been reported by 
many authors [43, 40, 37, 27], that all localization information can be obtained 
at a number of points within the ear canal (and possibly a few of millimeters 
outside). Binaural signals at the eardrum can be synthesized from HRTF mea- 
surements with the ear-canal blocked (so-called blocked-meatus measurements) 
along the ear canal if: 

■ The transfer function can be separated into its location-dependent and 
location-independent parts, where the location-dependent part can be 
measured with a blocked meatus. 

■ Blocked ear canal HRTF measurements with proper headphone compen- 
sation can reproduce the correct binaural signals at the ear drum. 

In practice, acoustic measurement of the HRTF requires meticulous prepa- 
ration and careful selection and/or design of the measurement equipment. It is 
not within the scope of this chapter to discuss these issues, but rather point the 
interested readers to some relevant literature [5]. Here we provide an example: 
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An HRTF measurement apparatus consists of the following components: 

1. Sound generator. 

2. Ear microphones (mounted at any position within the ear canal). 

3. Loudspeakers (movable or many of them at desired measurement points). 

4. Sound recorder (receives microphone pre-amplifier output) 

5. Subject restrain or reference. 

6. Signal processor. 

The signal processor generates tests signals that are played back and recorded 
through the system. The recording is then processed to estimate the system 
response a system-identification technique. Among these techniques, Golay 
codes and Maximum Length Sequences (MSL) are widely used [5]. Although 
formally the HRIR has infinite duration, only a truncated response is estimated 
with these systems. Care is also taken to compensate for the linear distortions 
introduced by the loudspeakers and microphones. A reference measurement 
without the subject and with the microphones placed at the origin (free-field 
measurement) is inverted and applied as compensation. There are many fac- 
tors that can reduce the fidelity of the HRTL measurement such as acoustic 
noise, electric noise, subject motion, apparatus inaccuracies, etc. All these 
factors need to be assessed and considered in the design of the system and the 
measurement protocols. 

The measurement apparatus in our example is shown in Lig. 13.7. In this sys- 
tem several loudspeakers are mounted on a 1- m-radius hoop that rotates about 
the interaural axis (the trajectory described by each loudspeaker corresponds 
to one slice of the cone of confusion.) HRTL measurement is complicated and 
expensive. Lortunately, some institutions and universities around the world 
make HRTL measurements available to the public for research purposes, see 
for example [3]. 

An alternative or complement to acoustic measurements are numerical so- 
lutions where the HRTL is computed from analysis of the geometry of the 
subject’s body. Photographs or three-dimensional laser scans are used to derive 
a geometry grid and numerical methods are applied to solve the wave equations 
at each point in the grid (e.g. boundary-element method (BEM) [31]. Propo- 
nents of these techniques argue that if made practical, this technology could 
replace acoustic measurements altogether. The advantage being that noise and 
subject motion are minimized. However, the acquisition of the detailed geom- 
etry of the human body and the computational complexity of these methods is 
still beyond practical limitations. 
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Figure 13.7 Example of an HRTF measurement system (courtesy of the CIPIC Interface Lab- 
oratory, UC Davis, California). 



4.2 HRTF MODELLING 

HRTF measurements are generally made available as a discrete set of finite 
impulse response (FIR) filter coefficients. Depending on the frequency reso- 
lution of the measurement, the length of the FIR filters will vary from a few 
hundred to a thousand samples (48 kHz sampling rate). The number of filters 
varies with the spatial resolution of the measurement apparatus, for example, 
to obtain a 5° spatial resolution in full 3-D space would require measurement 
of over 2,500 binaural responses. 

A brute-force implementation approach for a VSS system is to use the mea- 
surements directly and perform real-time convolution (i.e. filtering) with the 
sound sources This approach is extremely expensive if one considers the du- 
ration of the HRIR and the high sampling rates required in high-quality audio. 
Another problem is related to the finite resolution of the measurement, which 
makes it necessary to interpolate filter coefficients to achieve smooth position 
changes. 

As computational resources keep increasing, this approach would appear 
viable. However, rendering a large number of sources is also expected to be re- 
quired as bandwidth and consumer expectations increase. Other more efficient 
ways of processing and rendering binaural signals are required for implemen- 
tation of practical VSS systems. One common approach towards this goal is 
to model the HRTF or HRIR by a reduced number of parameters and to make 



Virtual Spatial Sound 361 



the processing more efficient by operating in this parametric domain. There are 
two main approaches for modeling the HRTF, as discussed next. 

4.2.1 Signal Models. In reality, the HRIR is an infinite impulse response 
system. However, as we discussed above in Section 4.2, due to practical limita- 
tions only a truncated impulse response can be measured. In some applications, 
the length of the measurement will prohibit its implementation by straight con- 
volution. It is then advantageous to simplify these responses, either by reducing 
redundancy or by removing trivial information. Another advantage of model- 
ing is that spatial interpolation is sometimes more conveniently done in the 
parameter space. 

In essence, the signal models look at the HRTF or HRIR as a set of signals 
or filters and attempt to best represent the set with the least number of param- 
eters (sometimes focusing on implementation constraints). Some examples of 
techniques are: 

■ Standard models where the minimum-phase part of the HRTF is used to 
derive filters and a delay line is used to model the frequency-independent 
ITD. 

■ Transform decompositions, such as principal component analysis (PCA) 
[20], spherical harmonics [19], etc. Interpolation of missing measure- 
ments works well with these models. 

■ Rational models of various kinds that attempt to model the data using sys- 
tem identification techniques. One example includes techniques where 
the HRlRs are modeled as HR pole-zero filters with a set of common 
location-independent poles and location-dependent zeros [28]. 

■ Perceptual models that remove perceptually-irrelevant information thus 
reducing the number of parameters required to represent the HRIR. An 
example are warped-frequency filter representations that take into account 
the non-uniform frequency resolution of the hearing system [30]. 

While these techniques achieve very good results, it is often the case that 
the models are not well-suited for customization to an individual listener. As 
discussed in Section 3.1, the HRTF depends strongly on the listener’s anatomy, 
and a flexible system capable of adapting to different listeners by a simple 
change of parameters is an appealing tool. One modeling technique that is 
employed to do this is described next. 

4.2.2 Structural Models. The main idea of a structural model is to factor 
the HRTF as a combination (parallel, serial or both) of independent parametric 
signal models. Each model represents a physical object that contributes to 
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Figure 13.8 Signal flow of a typical structural HRTF model. 

the acoustic modifications in the HRTF. As we have already seen, the sphere 
is a good model to describe the effect that the head has on the HRTF. The 
basic assumption in these models is an acoustic superposition principle where 
the different models do not overlap in space and/or frequency. For example, 
the sphere is a good model for the head whose major contribution is limited 
to the low-frequency part of the HRTF, but does not describe the behavior at 
high frequencies like a pinnae model would. Thus, a spherical-head model 
might be cascaded with a pinnae model to obtain an overall model valid for all 
frequencies. 

A general signal flow diagram of a structural mode is shown in Fig. 13.8. Each 
submodel consists of simple signal processing operations, such as delays and 
or low-order HR filter sections. Notice that one of the main advantages of these 
models is that the different submodels can receive data about the anthropometry 
of the listener and adjust its parameters accordingly. Many researchers have 
proposed structural models, for example: 

■ Head models: spherical head for ITD computation [57]; spherical head 
for ITD and head shadow computation [14]; spherical head with dis- 
placed ears for ITD computation [2]; ellipsoid with displaced ears for 
ITD computation [25]. 

■ Torso models: spherical or ellipsoidal head-and-torso (HAT) [8] that uses 
ray-tracing arguments to model shoulder reflections; Snowman model 
(spherical head and torso) [1] that uses ray tracing arguments and models 
torso shadow as well. 

■ Pinnae models: sum of delayed energy to model pinna notches [46]; 
beamforming approaches [10, 20]; physical models of sound diffraction 
and reflection [36]. 

■ Other models: contralateral pinnae models derived from ipsilateral HRTF 
[9]. 
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4.3 VIRTUAL SPATIAL SOUND RENDERING 

Once the HRTF or HRTF models have been selected and all sound sources 
have been filtered and combined into a binaural signal, the next step is to deliver 
this binaural sound to the listener. For the binaural technique to work, playback 
conditions need to match recording conditions. Two ways of achieving this are 
described next. 

4.3.1 Headphone Rendering. The simplest way of delivering a binaural 
signal is through headphones. The reason is that each playback channel has 
to be delivered to the ears without contamination (i.e. crosstalk) to match the 
conditions under which the recording was made. However, the headphone itself 
will introduce linear distortions of its own and the sound reaching the ear drum 
will not necessarily be the sound that would have reached it under free-field 
listening. A compensation that accounts for the response and loading of the 
acoustic transducer and ear cup has to be applied in the case of headphones. 
This compensation or headphone equalization function will depend on many 
factors, including the measurement point and type of headphones, as well as 
the listener pinnae [44]. 

An acoustic circuit model such as that described in [42] can be used to com- 
pute the necessary headphone equalization functions for various measurement 
conditions. If the HRTF is measured at any position with an open ear canal 
(including close to the eardrum) the headphone-to-measurement trans mi ssion 
has to be compensated for. For blocked ear canal measurements, both the 
headphone-to-open and the headphone-to-blocked transmission at the position 
of measurement may be needed to devise the correct compensation. Some stud- 
ies suggest that the scheme for binaural synthesis through headphones based on 
blocked meatus HRTF measurements and the sound transmission model in [42] 
is adequate to convey most localization cues available in free field hearing [5]. 
For HRTF’s measured very near to the eardrum [55], headphone rendering ne- 
cessitates one more compensation curve due to the ear canal resonance (which 
does not convey spatial information). Without compensation, the ear canal res- 
onance will be excited twice, and will introduce unwanted timbre differences, 
that might even result in a loss of realism. 

Headphone listening is a very unnatural experience. In an informal observa- 
tion, a young two-year-old toddler who was experiencing music playback over 
headphones for the first time was extremely disturbed when turning her head 
did not result in the expected motion of the sound sources she was perceiving. 
After repeated head rotations, the toddler decided to leave the headphones alone 
and asked that her tape be played over the loudspeakers, arguing that she did 
not like how the sound followed her when she was wearing the headphones. 
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To achieve realism, binaural signal rendering through headphones requires 
that the spatial cues for each source be changed according to its relative position 
with respect to the listener. Otherwise, the source will appear io follow the 
listener when he or she rotates or moves the head. This compensation is possible 
only if some knowledge about the absolute position and head rotation angle of 
the subject are available. This is generally achieved by a head tracker, which 
is basically a transducer that monitors the position and rotation angles of the 
head with respect to a known reference. Head tracker technology includes 
electromagnetic sensors, gyroscopes, optical sensors, cameras, or even sound 
[52]. The requirements of the head tracker will depend on the application. In 
general, the tracker must have low latency (below 96 ms [54]) and be accurate 
to within a couple of degrees. Head trackers are in general expensive and 
inconvenient to wear, thus their use in consumer applications is very low. 

For the VSS system to be able to compensate for head motion, the different 
sound sources and their respective spatial locations need to be available. If only 
a binaural signal is available, then the task becomes much more difficult since 
it would required source separation to recover the individual sources and their 
corresponding location. 

Another requirement for realistic rendering is to add an acoustic environment 
to the binaural signal. This process, also known as auralization can be either 
a simple addition of reverberation (real or artificial) or a very complex room 
model. The main advantage of auralizing the binaural signal is that the sound 
sources are externalized. HRTF measurements are in general anechoic, and 
synthesis of a binaural signal will result in a dry and unnatural experience. 
While it is believed that realistic externalization is only possible with good 
room models and accurate motion compensation, some research has shown 
that under engineering constraints, a canonical environment can be designed 
to improve localization and reduce front-back reversals [6]. Auralization is a 
vast topic and cannot be covered with detail in this chapter, see [12] for more 
details. 

4.3.2 Crosstalk-Cancellation Rendering. Another approach to render- 
ing binaural signals that does not require headphones but instead uses a stereo 
loudspeaker setup is the so-called crosstalk canceller. These systems are based 
on a simple and elegant concept first introduced in the early 1960’s by Shroeder 
and Atal [26] and Bauer [11]. The main idea is that in a stereo setup the acous- 
tic crosstalk paths from left speaker to right ear and right speaker to 

left ear hm{t) (see Fig. 13.9) that naturally occur in free field conditions are 
measured and/or modeled, and used to compute a preprocessing filtering net- 
work. This network creates two signals per channel that upon reaching the ears 
cancel and reinforce each other in a way that only the desired binaural signals 
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Figure 13.9 Acoustic paths in crosstalk cancellation. 



at each ear prevail. Mathematically, the filters in the cancelling network Qij (t) 
are derived to satisfy the following set of equations: 

hiiit) * 9LL{t) + hjiL{t) * gRiit) = ( 13 . 1 ) 

hiRit) * 9LR{t) + hRR{t) * 9RR{t) = 5{t), ( 13 . 2 ) 

hiiit) * 9LR{t) + hRiit) * 9RR{t) = 0 , ( 13 . 3 ) 

hLR{t) * 9LL{t) + hRR{t) * 9RL{t) = 0 ) ( 13 . 4 ) 

where * denotes convolution and 6{t) is a unit impulse. Generally, this system 
is solved in the frequency domain where convolution turns into multiplication. 

The solution to this set of equations is not always guaranteed to exist or 
be unique. Moreover, the solution may also be unstable and/or non-causal. 
However, in recent years some techniques to deal with these issues have proven 
effective in the design VSS systems. While the crosstalk cancellation concept 
is simple, in practice it is not easy to achieve good results. Assuming that 
a correct and well-designed crosstalk network is available, the sweet spot is 
typically extremely small and movements of just a few millimeters from the 
center result in severe localization errors, specially at higher frequencies (e.g. 
/> 3 kHz). 
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It is well-known that the main limitation of this system is the size of the 
sweet spot. Some techniques that try to alleviate this exploit the geometry 
of the setup and place the loudspeakers very close to each other. This stereo 
dipole achieves wider sweet spots but at the cost of acoustic efficiency since the 
cross-talk network filters will have very high gains [32]. Some authors have 
proposed solutions based on monitoring the motion of the head (e.g. using a 
video camera) and adapt the TACC according to the relative position changes 
between the loudspeakers and the listener [26]. 

Another problem for TACCs is the listening environment. Reverberation, 
specially early reflections form surrounding structures, will add more acoustic 
paths than those depicted in the idealized scenario ofFig. 13.9. These additional 
paths will not be cancelled by the TACC and will degrade the binaural effect. 

The choice of the HRTF used to model the direct and crosstalk paths has 
impact over the effectiveness of the system. Unfortunately, every listener will 
receive the best experience with its own HRTF, thus the system will have to 
be reconfigured each time it is used by a different listener. Given the difficulty 
and expense of measuring HRTFs, the structural modeling approach studied 
above (Section 4.2.2) seems attractive for real-world applications where simple 
anthropometric input by the user would automatically adjust a set ofparametric 
HRTFs structures and allow correct computation of the network filters. 

The TACC solution is then useful for a single-listener in a well-controlled 
environment. A major breakthrough in this area would considerably increase 
the usability of this technique in practical multi-listener VSS systems. 

5. CONCLUSIONS 

In this chapter, we have given an overview of several aspects relevant to 
the design of VSS systems for tele-collaboration applications. While the re- 
search and theory supporting this technology are well-established, practical 
issues continue to be the main challenge. Among them, we have seen that 
HRTF measurement and modelling have reached very advanced stages. In par- 
ticular, we believe that structural models will play an important role in future 
systems given their low computational requirements and their flexibility to be 
customized to individual listeners. Rendering of binaural audio continues to 
be the main problem. As we discussed, head-tracking for headphone playback 
is necessary to recreate the experience that listeners have in real environments, 
but this technology is expensive. Current technology for free-field rendering 
is acceptable only in very few cases, and it is likely that wavefield synthesis 
techniques, although more expensive, will have more success for multi-user 
applications in the near future. 
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In spite of its drawbacks, the realism and immersive qualities of binaural 
technology will continue to make it the basis for state-of-the-art VSS in next- 
generation tele-collaboration systems. 
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Notes 

1. Notice that in the interaural coordinate system for a constant lateral angle and range, the trajectory 
described by the source along the polar angle corresponds to a slice of the cone of confusion. 

2. This approach is also known as Convolvotron due to the name of the first system of this type (developed 
for NASA by Crystal River Engineering). 
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